Re: RFC: drop the T10 OSD code and its users

2017-04-13 Thread Martin K. Petersen

Christoph,

> The only real user of the T10 OSD protocol, the pNFS object layout
> driver never went to the point of having shipping products, and the
> other two users (osdblk and exofs) were simple example of it's usage.
>
> The code has been mostly unmaintained for years and is getting in the
> way of block / SCSI changes, so I think it's finally time to drop it.

No particular objections from me.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: remove REQ_OP_WRITE_SAME

2017-04-13 Thread Martin K. Petersen

Christoph,

> Now that we are using REQ_OP_WRITE_ZEROES for all zeroing needs in the
> kernel there is very little use left for REQ_OP_WRITE_SAME.  We only
> have two callers left, and both just export optional protocol features
> to remote systems: DRBD and the target code.

While I'm not particularly married to WRITE SAME, I do think it's a
shame that the RAID5/6 code never started using it. It does make a
difference when initializing RAID sets.

The other thing that keeps me a bit on the fence is that a bunch of the
plumbing to handle a bio with a payload different from bi_size is needed
for the copy offload token. I'm hoping to have those patches ready for
4.13. Right now there are a bunch of places where handling of
REQ_OP_COPY_IN and REQ_OP_COPY_OUT share conditionals with WRITE SAME.

So I suggest postponing the decision about whether to rip out WRITE SAME
until I finish the token stuff.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] block: fix bio_will_gap()

2017-04-13 Thread Ming Lei
On Thu, Apr 13, 2017 at 04:22:22PM +, Bart Van Assche wrote:
> On Fri, 2017-04-14 at 00:06 +0800, Ming Lei wrote:
> > But if one rq starts with non-aligned buffer(the 1st bvec's
> > bv_offset isn't zero) and if we allow to merge, it is quite
> > difficult to respect sg gap limit, especially the segment
> > can't be at maximum segment size, otherwise the segment
> > ends in unaligned virt boundary. This patch trys to avoid the
> > issue by not allowing to merge if the req starts with non-aligned
> > buffer.
> 
> Hello Ming,
> 
> Why is it considered difficult to detect whether or not a gap exists if
> the offset of the first bio is non-zero? Please note that a thoroughly
> tested version of gap detection code that supports non-zero offsets for
> the first element is already upstream. See also ib_map_mr_sg() and
> ib_sg_to_pages() in drivers/infiniband/core/verbs.c.

I don't know infiniband much, but from the code, it is just about
sg, and looks not same with block's sg gap limit.

The sg gap limit in block layer actually means:
- except for last segment, every segment has to end in aligned virt 
boundary
- from the 2nd segment, offset of the segment has to be zero

But we don't store main segment info(start, size) in bio/req,
all the segments are basically computed in the flight, so it isn't
easy to check/handle sg gap. Also IMO segment handling of block layer is
quite fragile, and hope multi-page bvec can improve the case much.

We relax check for sg gap for allowing merging tons of small bios, if
these bios are physically contiguous. But if offset of the 1st bio
in the merged req isn't zero, current segment compution in
__blk_segment_map_sg() doesn't consider sg gap, so one segment may
ends in unaligned virt boundary.

Given it is late in v4.11 cycle, this patch doesn't want to touch
__blk_segment_map_sg() for avoiding regression risk, instead just
changing bio_will_gap() for avoding the case with minimized risk,
because the change is quite simple.


Thanks,
Ming


Re: [PATCH] block: bios with an offset are always gappy

2017-04-13 Thread Ming Lei
On Thu, Apr 13, 2017 at 10:35:17PM +0200, Andreas Mohr wrote:
> On Thu, Apr 13, 2017 at 10:45:10PM +0800, Ming Lei wrote:
> > +   /*
> > +* don't merge if the 1st bio starts with non-zero
> > +* offset, otherwise it is quite difficult to respect
> > +* sg gap limit. We work hard to merge huge of small
> > +* bios in case of mkfs.
> 
> "huge of small bios in case" - -ENOPARSE.
> 
> "huge numbers of"?
> "huge or small"?

OK, it should be tons of or sort of description. As you can see
in the trace, there are 2560 bios merged in one request.

Thanks,
Ming


Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically

2017-04-13 Thread Ming Lei
On Thu, Apr 13, 2017 at 09:59:57AM -0700, Bart Van Assche wrote:
> On 04/12/17 19:20, Ming Lei wrote:
> > On Wed, Apr 12, 2017 at 06:38:07PM +, Bart Van Assche wrote:
> >> If the blk-mq core would always rerun a hardware queue if a block driver
> >> returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single CPU 
> >> core
> > 
> > It won't casue 100% CPU utilization since we restart queue in completion
> > path and at that time at least one tag is available, then progress can be
> > made.
> 
> Hello Ming,
> 
> Sorry but you are wrong. If .queue_rq() returns BLK_MQ_RQ_QUEUE_BUSY
> then it's likely that calling .queue_rq() again after only a few
> microseconds will cause it to return BLK_MQ_RQ_QUEUE_BUSY again. If you
> don't believe me, change "if (!blk_mq_sched_needs_restart(hctx) &&
> !test_bit(BLK_MQ_S_TAG_WAITING, >state)) blk_mq_run_hw_queue(hctx,
> true);" into "blk_mq_run_hw_queue(hctx, true);", trigger a busy

Yes, that can be true, but I mean it is still OK to run the queue again
with

if (!blk_mq_sched_needs_restart(hctx) &&
!test_bit(BLK_MQ_S_TAG_WAITING, >state))
blk_mq_run_hw_queue(hctx, true);

and restarting queue in __blk_mq_finish_request() when
BLK_MQ_RQ_QUEUE_BUSY is returned from .queue_rq(). And both are in current
blk-mq implementation.

Then why do we need blk_mq_delay_run_hw_queue(hctx, 100/*ms*/) in dm?

Thanks,
Ming


Re: [PATCH 5/6] blk-mq: Add blk_mq_ops.show_rq()

2017-04-13 Thread Omar Sandoval
On Tue, Apr 11, 2017 at 01:58:41PM -0700, Bart Van Assche wrote:
> This new callback function will be used in the next patch to show
> more information about SCSI requests.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Omar Sandoval 
> Cc: Hannes Reinecke 
> ---
>  block/blk-mq-debugfs.c | 10 --
>  include/linux/blk-mq.h |  6 ++
>  2 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 161f30fc236f..6b28d92d4c0e 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -316,7 +316,9 @@ static const char *const rqf_name[] = {
>  static int blk_mq_debugfs_rq_show(struct seq_file *m, void *v)
>  {
>   struct request *rq = list_entry_rq(v);
> + const struct blk_mq_ops *const mq_ops = rq->q->mq_ops;
>   const unsigned int op = rq->cmd_flags & REQ_OP_MASK;
> + char drv_info[200];
>  
>   seq_printf(m, "%p {.op=", rq);
>   if (op < ARRAY_SIZE(op_name) && op_name[op])
> @@ -329,8 +331,12 @@ static int blk_mq_debugfs_rq_show(struct seq_file *m, 
> void *v)
>   seq_puts(m, ", .rq_flags=");
>   blk_flags_show(m, (__force unsigned int)rq->rq_flags, rqf_name,
>  ARRAY_SIZE(rqf_name));
> - seq_printf(m, ", .tag=%d, .internal_tag=%d}\n", rq->tag,
> -rq->internal_tag);
> + if (mq_ops->show_rq)
> + mq_ops->show_rq(rq, drv_info, sizeof(drv_info));

How about passing the seq_file to the callback instead of this
arbitrarily-sized on-stack buffer?


Re: [PATCH 4/6] blk-mq: Show operation, cmd_flags and rq_flags names

2017-04-13 Thread Omar Sandoval
On Tue, Apr 11, 2017 at 01:58:40PM -0700, Bart Van Assche wrote:
> Show the operation name, .cmd_flags and .rq_flags as names instead
> of numbers.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Omar Sandoval 
> Cc: Hannes Reinecke 

I like this :) one minor nit below.

Reviewed-by: Omar Sandoval 

> ---
>  block/blk-mq-debugfs.c | 72 
> +++---
>  1 file changed, 69 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index aae4b7c7b7b0..161f30fc236f 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
[snip]
>  static int blk_mq_debugfs_rq_show(struct seq_file *m, void *v)
>  {
>   struct request *rq = list_entry_rq(v);
> + const unsigned int op = rq->cmd_flags & REQ_OP_MASK;
>  
> - seq_printf(m, "%p {.cmd_flags=0x%x, .rq_flags=0x%x, .tag=%d, 
> .internal_tag=%d}\n",
> -rq, rq->cmd_flags, (__force unsigned int)rq->rq_flags,
> -rq->tag, rq->internal_tag);
> + seq_printf(m, "%p {.op=", rq);
> + if (op < ARRAY_SIZE(op_name) && op_name[op])
> + seq_printf(m, "%s", op_name[op]);
> + else
> + seq_printf(m, "%d", op);
> + seq_puts(m, ", .cmd_flags=");
> + blk_flags_show(m, rq->cmd_flags ^ op, cmd_flag_name,
  ^^
I think rq->cmd_flags & ~REQ_OP_MASK is slightly clearer here, but I
don't feel that strongly about it, it's up to you.

> +ARRAY_SIZE(cmd_flag_name));
> + seq_puts(m, ", .rq_flags=");
> + blk_flags_show(m, (__force unsigned int)rq->rq_flags, rqf_name,
> +ARRAY_SIZE(rqf_name));
> + seq_printf(m, ", .tag=%d, .internal_tag=%d}\n", rq->tag,
> +rq->internal_tag);
>   return 0;
>  }
>  
> -- 
> 2.12.0
> 


RE: [PATCH] block-mq: set both block queue and hardware queue restart bit for restart

2017-04-13 Thread Long Li


> -Original Message-
> From: Bart Van Assche [mailto:bart.vanass...@sandisk.com]
> Sent: Monday, April 10, 2017 4:48 PM
> To: linux-ker...@vger.kernel.org; linux-block@vger.kernel.org; KY Srinivasan
> ; Long Li ; ax...@kernel.dk
> Cc: Stephen Hemminger 
> Subject: Re: [PATCH] block-mq: set both block queue and hardware queue
> restart bit for restart
> 
> On Thu, 2017-04-06 at 04:21 +, KY Srinivasan wrote:
> > > -Original Message-
> > > From: Bart Van Assche [mailto:bart.vanass...@sandisk.com]
> > > Sent: Wednesday, April 5, 2017 8:46 PM
> > > To: linux-ker...@vger.kernel.org; linux-block@vger.kernel.org; Long
> > > Li ; ax...@kernel.dk
> > > Cc: Stephen Hemminger ; KY Srinivasan
> > > 
> > > Subject: Re: [PATCH] block-mq: set both block queue and hardware
> > > queue restart bit for restart
> > >
> > > On Thu, 2017-04-06 at 03:38 +, Long Li wrote:
> > > > > -Original Message-
> > > > > From: Bart Van Assche [mailto:bart.vanass...@sandisk.com]
> > > > >
> > > > > Please drop this patch. I'm working on a better solution.
> > > >
> > > > Thank you. Looking forward to your patch.
> > >
> > > Hello Long,
> > >
> > > It would help if you could share the name of the block or SCSI
> > > driver with which you ran into that lockup and also if you could
> > > share the name of the I/O scheduler used in your test.
> >
> > The tests that indicated the issue were run Hyper-V. The driver is
> > storvsc_drv.c The I/O scheduler was I think noop.
> 
> Hello Long and K.Y.,
> 
> Thank you for the feedback. Can you repeat your test with kernel v4.11-rc6?
> The patches that went into the block layer for v4.11-rc6 should be sufficient
> to fix
> this:
> 
> $ PAGER= git log --format=short v4.11-rc5..v4.11-rc6 block include/linux/blk*
> commit 6d8c6c0f97ad8a3517c42b179c1dc8e77397d0e2
> Author: Bart Van Assche 
> 
> blk-mq: Restart a single queue if tag sets are shared
> 
> commit 7587a5ae7eef0439f7be31f1b5959af062bbc5ec
> Author: Bart Van Assche 
> 
> blk-mq: Introduce blk_mq_delay_run_hw_queue()
> 
> commit ebe8bddb6e30d7a02775b9972099271fc5910f37
> Author: Omar Sandoval 
> 
> blk-mq: remap queues when adding/removing hardware queues
> 
> commit 54d5329d425650fafaf90660a139c771d2d49cae
> Author: Omar Sandoval 
> 
> blk-mq-sched: fix crash in switch error path
> 
> commit 93252632e828da3e90241a1c0e766556abf71598
> Author: Omar Sandoval 
> 
> blk-mq-sched: set up scheduler tags when bringing up new queues
> 
> commit 6917ff0b5bd4139e08a3f3146529dcb3b95ba7a6
> Author: Omar Sandoval 
> 
> blk-mq-sched: refactor scheduler initialization
> 
> commit 81380ca10778b99dce98940cfc993214712df335
> Author: Omar Sandoval 
> 
> blk-mq: use the right hctx when getting a driver tag fails
> 
> commit ac77a0c463c1d7d659861f7b6d1261970dd3282a
> Author: Minchan Kim 
> 
> block: do not put mq context in blk_mq_alloc_request_hctx
> 
> Bart.

Thank you. We are doing tests. I will update on the results.

Long


Re: [PATCH 3/6] blk-mq: Make blk_flags_show() callers append a newline character

2017-04-13 Thread Omar Sandoval
On Tue, Apr 11, 2017 at 01:58:39PM -0700, Bart Van Assche wrote:
> This patch does not change any functionality but makes it possible
> to produce a single line of output with multiple flag-to-name
> translations.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Omar Sandoval 
> Cc: Hannes Reinecke 

Reviewed-by: Omar Sandoval 


Re: [PATCH 1/6] blk-mq: Do not invoke queue operations on a dead queue

2017-04-13 Thread Bart Van Assche
On Thu, 2017-04-13 at 16:01 -0700, Omar Sandoval wrote:
> On Tue, Apr 11, 2017 at 01:58:37PM -0700, Bart Van Assche wrote:
> > The blk-mq debugfs attributes are removed after blk_cleanup_queue()
> > has finished. Since running a queue after a queue has entered the
> > "dead" state is not allowed, disallow this. This patch avoids that
> > an attempt to run a dead queue triggers a kernel crash.
> > 
> > Signed-off-by: Bart Van Assche 
> > Cc: Omar Sandoval 
> > Cc: Hannes Reinecke 
> > ---
> >  block/blk-mq-debugfs.c | 8 
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> > index df9b688b877c..a1ce823578c7 100644
> > --- a/block/blk-mq-debugfs.c
> > +++ b/block/blk-mq-debugfs.c
> > @@ -111,6 +111,14 @@ static ssize_t blk_queue_flags_store(struct file 
> > *file, const char __user *ubuf,
> > struct request_queue *q = file_inode(file)->i_private;
> > char op[16] = { }, *s;
> >  
> > +   /*
> > +* The debugfs attributes are removed after blk_cleanup_queue() has
> > +* called blk_mq_free_queue(). Return if QUEUE_FLAG_DEAD has been set
> > +* to avoid triggering a use-after-free.
> > +*/
> > +   if (blk_queue_dead(q))
> > +   return -ENOENT;
> > +
> > len = min(len, sizeof(op) - 1);
> > if (copy_from_user(op, ubuf, len))
> > return -EFAULT;
> 
> Looking at this, I think we have similar issues with most of the other
> debugfs files. Should we move the debugfs cleanup earlier?

Hello Omar,

That's a good question. However, while I was debugging it was very convenient
to be able to access the queue state after it had reached the "dead" state.
Performing the cleanup earlier would be an alternative solution but would
make debugging a bit harder ...

Bart.

Re: [PATCH 2/6] blk-mq: Move the "state" debugfs attribute one level down

2017-04-13 Thread Omar Sandoval
On Tue, Apr 11, 2017 at 01:58:38PM -0700, Bart Van Assche wrote:
> Move the "state" attribute from the top level to the "mq" directory
> as requested by Omar.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Omar Sandoval 
> Cc: Hannes Reinecke 

Thanks!

Reviewed-by: Omar Sandoval 

> ---
>  block/blk-mq-debugfs.c | 9 +
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index a1ce823578c7..564470d4af52 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -149,11 +149,6 @@ static const struct file_operations blk_queue_flags_fops 
> = {
>   .write  = blk_queue_flags_store,
>  };
>  
> -static const struct blk_mq_debugfs_attr blk_queue_attrs[] = {
> - {"state", 0600, _queue_flags_fops},
> - {},
> -};
> -
>  static void print_stat(struct seq_file *m, struct blk_rq_stat *stat)
>  {
>   if (stat->nr_samples) {
> @@ -762,6 +757,7 @@ static const struct file_operations ctx_completed_fops = {
>  
>  static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
>   {"poll_stat", 0400, _poll_stat_fops},
> + {"state", 0600, _queue_flags_fops},
>   {},
>  };
>  
> @@ -877,9 +873,6 @@ int blk_mq_debugfs_register_hctxs(struct request_queue *q)
>   if (!q->debugfs_dir)
>   return -ENOENT;
>  
> - if (!debugfs_create_files(q->debugfs_dir, q, blk_queue_attrs))
> - goto err;
> -
>   q->mq_debugfs_dir = debugfs_create_dir("mq", q->debugfs_dir);
>   if (!q->mq_debugfs_dir)
>   goto err;
> -- 
> 2.12.0
> 


Re: [PATCH 1/6] blk-mq: Do not invoke queue operations on a dead queue

2017-04-13 Thread Omar Sandoval
On Tue, Apr 11, 2017 at 01:58:37PM -0700, Bart Van Assche wrote:
> The blk-mq debugfs attributes are removed after blk_cleanup_queue()
> has finished. Since running a queue after a queue has entered the
> "dead" state is not allowed, disallow this. This patch avoids that
> an attempt to run a dead queue triggers a kernel crash.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Omar Sandoval 
> Cc: Hannes Reinecke 
> ---
>  block/blk-mq-debugfs.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index df9b688b877c..a1ce823578c7 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -111,6 +111,14 @@ static ssize_t blk_queue_flags_store(struct file *file, 
> const char __user *ubuf,
>   struct request_queue *q = file_inode(file)->i_private;
>   char op[16] = { }, *s;
>  
> + /*
> +  * The debugfs attributes are removed after blk_cleanup_queue() has
> +  * called blk_mq_free_queue(). Return if QUEUE_FLAG_DEAD has been set
> +  * to avoid triggering a use-after-free.
> +  */
> + if (blk_queue_dead(q))
> + return -ENOENT;
> +
>   len = min(len, sizeof(op) - 1);
>   if (copy_from_user(op, ubuf, len))
>   return -EFAULT;
> -- 
> 2.12.0
> 

Hi, Bart,

Looking at this, I think we have similar issues with most of the other
debugfs files. Should we move the debugfs cleanup earlier?


Re: [PATCH] block: bios with an offset are always gappy

2017-04-13 Thread Andreas Mohr
On Thu, Apr 13, 2017 at 10:45:10PM +0800, Ming Lei wrote:
> + /*
> +  * don't merge if the 1st bio starts with non-zero
> +  * offset, otherwise it is quite difficult to respect
> +  * sg gap limit. We work hard to merge huge of small
> +  * bios in case of mkfs.

"huge of small bios in case" - -ENOPARSE.

"huge numbers of"?
"huge or small"?

HTH,

Andreas Mohr


Re: [PATCH 08/25] scsi: fix fast-fail for non-passthrough requests

2017-04-13 Thread Bart Van Assche
On Thu, 2017-04-06 at 17:39 +0200, Christoph Hellwig wrote:
> Currently error is always 0 for non-passthrough requests when reaching the
> scsi_noretry_cmd check in scsi_io_completion, which effectively disables
> all fastfail logic.  Fix this by having a single call to
> __scsi_error_from_host_byte at the beginning of the function and always
> having a valid error value.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/scsi/scsi_lib.c | 28 +++-
>  1 file changed, 7 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 11972d1075f1..89b4d9e69866 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -779,21 +779,17 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned 
> int good_bytes)
>   sense_valid = scsi_command_normalize_sense(cmd, );
>   if (sense_valid)
>   sense_deferred = scsi_sense_is_deferred();
> +
> + if (!sense_deferred)
> + error = __scsi_error_from_host_byte(cmd, result);
>   }

Hello Christoph,

Sorry but this doesn't look correct to me. Further down a "error =
__scsi_error_from_host_byte(cmd, result)" statement is removed that does not
depend on "if (!sense_deferred)" so I think that assignment should happen
independent of the value of "sense_deferred". Additionally, how can it make
sense to call __scsi_error_from_host_byte() only if sense_deferred == false?
As you know the SCSI command result is generated by the LLD so I don't think
that it depends on whether or not the sense data has been deferred.

Bart.

Re: [PATCH 06/25] virtio: fix spelling of virtblk_scsi_request_done

2017-04-13 Thread Bart Van Assche
On Thu, 2017-04-06 at 17:39 +0200, Christoph Hellwig wrote:
> [ ... ]

Reviewed-by: Bart Van Assche 

Re: [PATCH 02/25] block: remove the blk_execute_rq return value

2017-04-13 Thread Bart Van Assche
On Thu, 2017-04-06 at 17:39 +0200, Christoph Hellwig wrote:
> diff --git a/fs/nfsd/blocklayout.c b/fs/nfsd/blocklayout.c
> index 92b4b41d19d2..4b72fdf67548 100644
> --- a/fs/nfsd/blocklayout.c
> +++ b/fs/nfsd/blocklayout.c
> @@ -242,8 +242,8 @@ static int nfsd4_scsi_identify_device(struct block_device 
> *bdev,
>   req->cmd[4] = bufflen & 0xff;
>   req->cmd_len = COMMAND_SIZE(INQUIRY);
>  
> - error = blk_execute_rq(rq->q, NULL, rq, 1);
> - if (error) {
> + blk_execute_rq(rq->q, NULL, rq, 1);
> + if (rq->errors) {
>   pr_err("pNFS: INQUIRY 0x83 failed with: %x\n",
>   rq->errors);
>   goto out_put_request;

Hello Christoph,

That blk_execute_rq() call can only be reached if a few lines above 0 was
assigned to the "error" variable. Since nfsd4_scsi_identify_device() returns
the value of the "error" variable I think -EIO should be assigned to that
variable before the "goto out_put_request" statement is reached.

Bart.

Re: [PATCH 01/25] remove the mg_disk driver

2017-04-13 Thread Bart Van Assche
On Thu, 2017-04-06 at 17:39 +0200, Christoph Hellwig wrote:
> This drivers was added in 2008, but as far as a I can tell we never had a
> single platform that actually registered resources for the platform driver.
> 
> It's also been unmaintained for a long time and apparently has a ATA mode
> that can be driven using the IDE/libata subsystem.

Hello Christoph,

Should the person who submitted this driver be CC-ed for this patch (unsik
Kim )?

Bart.

Re: [PATCH] lightnvm: fix some WARN() messages

2017-04-13 Thread Matias Bjørling



On 04/13/2017 09:36 PM, Dan Carpenter wrote:

WARN_ON() takes a condition, not an error message.  I slightly tweaked
some conditions so hopefully it's more clear.

Signed-off-by: Dan Carpenter 

diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
index eff0982c076f..bce7ed5fc73f 100644
--- a/drivers/lightnvm/pblk-read.c
+++ b/drivers/lightnvm/pblk-read.c
@@ -49,8 +49,8 @@ static void pblk_read_ppalist_rq(struct pblk *pblk, struct 
nvm_rq *rqd,
int i, j = 0;

/* logic error: lba out-of-bounds. Ignore read request */
-   if (!(blba + nr_secs < pblk->rl.nr_secs)) {
-   WARN_ON("pblk: read lbas out of bounds\n");
+   if (blba + nr_secs >= pblk->rl.nr_secs) {
+   WARN(1, "pblk: read lbas out of bounds\n");
return;
}

@@ -254,8 +254,8 @@ static void pblk_read_rq(struct pblk *pblk, struct nvm_rq 
*rqd,
sector_t lba = pblk_get_lba(bio);

/* logic error: lba out-of-bounds. Ignore read request */
-   if (!(lba < pblk->rl.nr_secs)) {
-   WARN_ON("pblk: read lba out of bounds\n");
+   if (lba >= pblk->rl.nr_secs) {
+   WARN(1, "pblk: read lba out of bounds\n");
return;
}

@@ -411,8 +411,8 @@ static int read_rq_gc(struct pblk *pblk, struct nvm_rq *rqd,
int valid_secs = 0;

/* logic error: lba out-of-bounds */
-   if (!(lba < pblk->rl.nr_secs)) {
-   WARN_ON("pblk: read lba out of bounds\n");
+   if (lba >= pblk->rl.nr_secs) {
+   WARN(1, "pblk: read lba out of bounds\n");
goto out;
}

diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c
index 02c415b957d4..56547ca87926 100644
--- a/drivers/lightnvm/pblk-write.c
+++ b/drivers/lightnvm/pblk-write.c
@@ -141,7 +141,7 @@ static void pblk_end_w_fail(struct pblk *pblk, struct 
nvm_rq *rqd)

/* Logic error */
if (bit > c_ctx->nr_valid) {
-   WARN_ON_ONCE("pblk: corrupted write request\n");
+   WARN_ONCE(1, "pblk: corrupted write request\n");
goto out;
}

diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index 0d50f415cfde..f8f85087cd3c 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -167,7 +167,7 @@ static int pblk_recov_l2p_from_emeta(struct pblk *pblk, 
struct pblk_line *line)
if (le64_to_cpu(lba_list[i]) == ADDR_EMPTY) {
spin_lock(>lock);
if (test_and_set_bit(i, line->invalid_bitmap))
-   WARN_ON_ONCE("pblk: rec. double invalidate:\n");
+   WARN_ONCE(1, "pblk: rec. double invalidate:\n");
else
line->vsc--;
spin_unlock(>lock);



Hehe, yep. That was slightly misused. Thanks for pointing it out. I've 
applied it for 4.12.


Re: [PATCH] lightnvm: fix some error code in pblk-init.c

2017-04-13 Thread Matias Bjørling



On 04/13/2017 09:35 PM, Dan Carpenter wrote:

There were a bunch of places in pblk_lines_init() where we didn't set an
error code.  And in pblk_writer_init() we accidentally return 1 instead
of a correct error code, which would result in a Oops later.

Fixes: 11a5d6fdf919 ("lightnvm: physical block device (pblk) target")
Signed-off-by: Dan Carpenter 

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 94653b1f1300..3996e4b8fb0e 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -543,7 +543,7 @@ static int pblk_lines_init(struct pblk *pblk)
long nr_bad_blks, nr_meta_blks, nr_free_blks;
int bb_distance;
int i;
-   int ret = 0;
+   int ret;

lm->sec_per_line = geo->sec_per_blk * geo->nr_luns;
lm->blk_per_line = geo->nr_luns;
@@ -638,12 +638,16 @@ static int pblk_lines_init(struct pblk *pblk)
}

l_mg->bb_template = kzalloc(lm->sec_bitmap_len, GFP_KERNEL);
-   if (!l_mg->bb_template)
+   if (!l_mg->bb_template) {
+   ret = -ENOMEM;
goto fail_free_meta;
+   }

l_mg->bb_aux = kzalloc(lm->sec_bitmap_len, GFP_KERNEL);
-   if (!l_mg->bb_aux)
+   if (!l_mg->bb_aux) {
+   ret = -ENOMEM;
goto fail_free_bb_template;
+   }

bb_distance = (geo->nr_luns) * geo->sec_per_pl;
for (i = 0; i < lm->sec_per_line; i += bb_distance)
@@ -667,8 +671,10 @@ static int pblk_lines_init(struct pblk *pblk)

pblk->lines = kcalloc(l_mg->nr_lines, sizeof(struct pblk_line),
GFP_KERNEL);
-   if (!pblk->lines)
+   if (!pblk->lines) {
+   ret = -ENOMEM;
goto fail_free_bb_aux;
+   }

nr_free_blks = 0;
for (i = 0; i < l_mg->nr_lines; i++) {
@@ -682,8 +688,10 @@ static int pblk_lines_init(struct pblk *pblk)
spin_lock_init(>lock);

nr_bad_blks = pblk_bb_line(pblk, line);
-   if (nr_bad_blks < 0 || nr_bad_blks > lm->blk_per_line)
+   if (nr_bad_blks < 0 || nr_bad_blks > lm->blk_per_line) {
+   ret = -EINVAL;
goto fail_free_lines;
+   }

line->blk_in_line = lm->blk_per_line - nr_bad_blks;
if (line->blk_in_line < lm->min_blk_line) {
@@ -733,7 +741,7 @@ static int pblk_writer_init(struct pblk *pblk)
pblk->writer_ts = kthread_create(pblk_write_ts, pblk, "pblk-writer-t");
if (IS_ERR(pblk->writer_ts)) {
pr_err("pblk: could not allocate writer kthread\n");
-   return 1;
+   return PTR_ERR(pblk->writer_ts);
}

return 0;



Thanks Dan. Applied for 4.12.


[PATCH block-tree] net: off by one in inet6_pton()

2017-04-13 Thread Dan Carpenter
If "scope_len" is sizeof(scope_id) then we would put the NUL terminator
one space beyond the end of the buffer.

Fixes: b1a951fe469e ("net/utils: generic inet_pton_with_scope helper")
Signed-off-by: Dan Carpenter 
---
This one goes through Jens' tree not through net-dev.

diff --git a/net/core/utils.c b/net/core/utils.c
index da1089ea5389..93066bd0305a 100644
--- a/net/core/utils.c
+++ b/net/core/utils.c
@@ -339,7 +339,7 @@ static int inet6_pton(struct net *net, const char *src, u16 
port_num,
src + srclen != scope_delim && *scope_delim == '%') {
struct net_device *dev;
char scope_id[16];
-   size_t scope_len = min_t(size_t, sizeof(scope_id),
+   size_t scope_len = min_t(size_t, sizeof(scope_id) - 1,
 src + srclen - scope_delim - 1);
 
memcpy(scope_id, scope_delim + 1, scope_len);


[PATCH] lightnvm: fix some WARN() messages

2017-04-13 Thread Dan Carpenter
WARN_ON() takes a condition, not an error message.  I slightly tweaked
some conditions so hopefully it's more clear.

Signed-off-by: Dan Carpenter 

diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
index eff0982c076f..bce7ed5fc73f 100644
--- a/drivers/lightnvm/pblk-read.c
+++ b/drivers/lightnvm/pblk-read.c
@@ -49,8 +49,8 @@ static void pblk_read_ppalist_rq(struct pblk *pblk, struct 
nvm_rq *rqd,
int i, j = 0;
 
/* logic error: lba out-of-bounds. Ignore read request */
-   if (!(blba + nr_secs < pblk->rl.nr_secs)) {
-   WARN_ON("pblk: read lbas out of bounds\n");
+   if (blba + nr_secs >= pblk->rl.nr_secs) {
+   WARN(1, "pblk: read lbas out of bounds\n");
return;
}
 
@@ -254,8 +254,8 @@ static void pblk_read_rq(struct pblk *pblk, struct nvm_rq 
*rqd,
sector_t lba = pblk_get_lba(bio);
 
/* logic error: lba out-of-bounds. Ignore read request */
-   if (!(lba < pblk->rl.nr_secs)) {
-   WARN_ON("pblk: read lba out of bounds\n");
+   if (lba >= pblk->rl.nr_secs) {
+   WARN(1, "pblk: read lba out of bounds\n");
return;
}
 
@@ -411,8 +411,8 @@ static int read_rq_gc(struct pblk *pblk, struct nvm_rq *rqd,
int valid_secs = 0;
 
/* logic error: lba out-of-bounds */
-   if (!(lba < pblk->rl.nr_secs)) {
-   WARN_ON("pblk: read lba out of bounds\n");
+   if (lba >= pblk->rl.nr_secs) {
+   WARN(1, "pblk: read lba out of bounds\n");
goto out;
}
 
diff --git a/drivers/lightnvm/pblk-write.c b/drivers/lightnvm/pblk-write.c
index 02c415b957d4..56547ca87926 100644
--- a/drivers/lightnvm/pblk-write.c
+++ b/drivers/lightnvm/pblk-write.c
@@ -141,7 +141,7 @@ static void pblk_end_w_fail(struct pblk *pblk, struct 
nvm_rq *rqd)
 
/* Logic error */
if (bit > c_ctx->nr_valid) {
-   WARN_ON_ONCE("pblk: corrupted write request\n");
+   WARN_ONCE(1, "pblk: corrupted write request\n");
goto out;
}
 
diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index 0d50f415cfde..f8f85087cd3c 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -167,7 +167,7 @@ static int pblk_recov_l2p_from_emeta(struct pblk *pblk, 
struct pblk_line *line)
if (le64_to_cpu(lba_list[i]) == ADDR_EMPTY) {
spin_lock(>lock);
if (test_and_set_bit(i, line->invalid_bitmap))
-   WARN_ON_ONCE("pblk: rec. double invalidate:\n");
+   WARN_ONCE(1, "pblk: rec. double invalidate:\n");
else
line->vsc--;
spin_unlock(>lock);


Re: [PATCH] lightnvm: pblk-gc: fix an error pointer dereference in init

2017-04-13 Thread Matias Bjørling

On 04/13/2017 09:32 PM, Dan Carpenter wrote:

These labels are reversed so we could end up dereferencing an error
pointer or leaking.

Fixes: 7f347ba6bb3a ("lightnvm: physical block device (pblk) target")
Signed-off-by: Dan Carpenter 

diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c
index 9b147cfd8a41..f173fd4ea947 100644
--- a/drivers/lightnvm/pblk-gc.c
+++ b/drivers/lightnvm/pblk-gc.c
@@ -527,10 +527,10 @@ int pblk_gc_init(struct pblk *pblk)

return 0;

-fail_free_main_kthread:
-   kthread_stop(gc->gc_ts);
 fail_free_writer_kthread:
kthread_stop(gc->gc_writer_ts);
+fail_free_main_kthread:
+   kthread_stop(gc->gc_ts);

return ret;
 }



Thanks Dan. I've applied it for 4.12.


[PATCH] lightnvm: fix some error code in pblk-init.c

2017-04-13 Thread Dan Carpenter
There were a bunch of places in pblk_lines_init() where we didn't set an
error code.  And in pblk_writer_init() we accidentally return 1 instead
of a correct error code, which would result in a Oops later.

Fixes: 11a5d6fdf919 ("lightnvm: physical block device (pblk) target")
Signed-off-by: Dan Carpenter 

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 94653b1f1300..3996e4b8fb0e 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -543,7 +543,7 @@ static int pblk_lines_init(struct pblk *pblk)
long nr_bad_blks, nr_meta_blks, nr_free_blks;
int bb_distance;
int i;
-   int ret = 0;
+   int ret;
 
lm->sec_per_line = geo->sec_per_blk * geo->nr_luns;
lm->blk_per_line = geo->nr_luns;
@@ -638,12 +638,16 @@ static int pblk_lines_init(struct pblk *pblk)
}
 
l_mg->bb_template = kzalloc(lm->sec_bitmap_len, GFP_KERNEL);
-   if (!l_mg->bb_template)
+   if (!l_mg->bb_template) {
+   ret = -ENOMEM;
goto fail_free_meta;
+   }
 
l_mg->bb_aux = kzalloc(lm->sec_bitmap_len, GFP_KERNEL);
-   if (!l_mg->bb_aux)
+   if (!l_mg->bb_aux) {
+   ret = -ENOMEM;
goto fail_free_bb_template;
+   }
 
bb_distance = (geo->nr_luns) * geo->sec_per_pl;
for (i = 0; i < lm->sec_per_line; i += bb_distance)
@@ -667,8 +671,10 @@ static int pblk_lines_init(struct pblk *pblk)
 
pblk->lines = kcalloc(l_mg->nr_lines, sizeof(struct pblk_line),
GFP_KERNEL);
-   if (!pblk->lines)
+   if (!pblk->lines) {
+   ret = -ENOMEM;
goto fail_free_bb_aux;
+   }
 
nr_free_blks = 0;
for (i = 0; i < l_mg->nr_lines; i++) {
@@ -682,8 +688,10 @@ static int pblk_lines_init(struct pblk *pblk)
spin_lock_init(>lock);
 
nr_bad_blks = pblk_bb_line(pblk, line);
-   if (nr_bad_blks < 0 || nr_bad_blks > lm->blk_per_line)
+   if (nr_bad_blks < 0 || nr_bad_blks > lm->blk_per_line) {
+   ret = -EINVAL;
goto fail_free_lines;
+   }
 
line->blk_in_line = lm->blk_per_line - nr_bad_blks;
if (line->blk_in_line < lm->min_blk_line) {
@@ -733,7 +741,7 @@ static int pblk_writer_init(struct pblk *pblk)
pblk->writer_ts = kthread_create(pblk_write_ts, pblk, "pblk-writer-t");
if (IS_ERR(pblk->writer_ts)) {
pr_err("pblk: could not allocate writer kthread\n");
-   return 1;
+   return PTR_ERR(pblk->writer_ts);
}
 
return 0;


[PATCH] lightnvm: pblk-gc: fix an error pointer dereference in init

2017-04-13 Thread Dan Carpenter
These labels are reversed so we could end up dereferencing an error
pointer or leaking.

Fixes: 7f347ba6bb3a ("lightnvm: physical block device (pblk) target")
Signed-off-by: Dan Carpenter 

diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c
index 9b147cfd8a41..f173fd4ea947 100644
--- a/drivers/lightnvm/pblk-gc.c
+++ b/drivers/lightnvm/pblk-gc.c
@@ -527,10 +527,10 @@ int pblk_gc_init(struct pblk *pblk)
 
return 0;
 
-fail_free_main_kthread:
-   kthread_stop(gc->gc_ts);
 fail_free_writer_kthread:
kthread_stop(gc->gc_writer_ts);
+fail_free_main_kthread:
+   kthread_stop(gc->gc_ts);
 
return ret;
 }


Re: [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically

2017-04-13 Thread Bart Van Assche
On 04/12/17 19:20, Ming Lei wrote:
> On Wed, Apr 12, 2017 at 06:38:07PM +, Bart Van Assche wrote:
>> If the blk-mq core would always rerun a hardware queue if a block driver
>> returns BLK_MQ_RQ_QUEUE_BUSY then that would cause 100% of a single CPU core
> 
> It won't casue 100% CPU utilization since we restart queue in completion
> path and at that time at least one tag is available, then progress can be
> made.

Hello Ming,

Sorry but you are wrong. If .queue_rq() returns BLK_MQ_RQ_QUEUE_BUSY
then it's likely that calling .queue_rq() again after only a few
microseconds will cause it to return BLK_MQ_RQ_QUEUE_BUSY again. If you
don't believe me, change "if (!blk_mq_sched_needs_restart(hctx) &&
!test_bit(BLK_MQ_S_TAG_WAITING, >state)) blk_mq_run_hw_queue(hctx,
true);" into "blk_mq_run_hw_queue(hctx, true);", trigger a busy
condition for either a SCSI LLD or a dm-rq driver, run top and you will
see that the CPU usage of a kworker thread jumps up to 100%.

Bart.


Re: [PATCH] block: fix bio_will_gap()

2017-04-13 Thread Bart Van Assche
On Fri, 2017-04-14 at 00:06 +0800, Ming Lei wrote:
> But if one rq starts with non-aligned buffer(the 1st bvec's
> bv_offset isn't zero) and if we allow to merge, it is quite
> difficult to respect sg gap limit, especially the segment
> can't be at maximum segment size, otherwise the segment
> ends in unaligned virt boundary. This patch trys to avoid the
> issue by not allowing to merge if the req starts with non-aligned
> buffer.

Hello Ming,

Why is it considered difficult to detect whether or not a gap exists if
the offset of the first bio is non-zero? Please note that a thoroughly
tested version of gap detection code that supports non-zero offsets for
the first element is already upstream. See also ib_map_mr_sg() and
ib_sg_to_pages() in drivers/infiniband/core/verbs.c.

Thanks,

Bart.

[PATCH] block: fix bio_will_gap()

2017-04-13 Thread Ming Lei
Now commit 729204ef49ec("block: relax check on sg gap")
allows us to merge bios if both are physically contiguous,
this change can merge huge of small bios in use case of mkfs,
for example, mkfs.ntfs running time can be decreased to ~1/10.

But if one rq starts with non-aligned buffer(the 1st bvec's
bv_offset isn't zero) and if we allow to merge, it is quite
difficult to respect sg gap limit, especially the segment
can't be at maximum segment size, otherwise the segment
ends in unaligned virt boundary. This patch trys to avoid the
issue by not allowing to merge if the req starts with non-aligned
buffer.

Also add comments to explain why the merged segment can't
end in unaligned virt boundary.

Fixes: 729204ef49ec ("block: relax check on sg gap")
Tested-by: Johannes Thumshirn 
Reviewed-by: Johannes Thumshirn 
Signed-off-by: Ming Lei 
---
 include/linux/blkdev.h | 32 
 1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b75d6fe5a1b9..5eaf323de98b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1672,12 +1672,36 @@ static inline bool bios_segs_mergeable(struct 
request_queue *q,
return true;
 }
 
-static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
-struct bio *next)
+static inline bool bio_will_gap(struct request_queue *q,
+   struct request *prev_rq,
+   struct bio *prev,
+   struct bio *next)
 {
if (bio_has_data(prev) && queue_virt_boundary(q)) {
struct bio_vec pb, nb;
 
+   /*
+* don't merge if the 1st bio starts with non-zero
+* offset, otherwise it is quite difficult to respect
+* sg gap limit. We work hard to merge huge of small
+* single bios in case of mkfs.
+*/
+   if (prev_rq)
+   bio_get_first_bvec(prev_rq->bio, );
+   else
+   bio_get_first_bvec(prev, );
+   if (pb.bv_offset)
+   return true;
+
+   /*
+* We don't need to worry about the situation that the
+* merged segment ends in unaligned virt boundary:
+*
+* - if 'pb' ends aligned, the merged segment ends aligned
+* - if 'pb' ends unaligned, the next bio must include
+*   one single bvec of 'nb', otherwise the 'nb' can't
+*   merge with 'pb'
+*/
bio_get_last_bvec(prev, );
bio_get_first_bvec(next, );
 
@@ -1690,12 +1714,12 @@ static inline bool bio_will_gap(struct request_queue 
*q, struct bio *prev,
 
 static inline bool req_gap_back_merge(struct request *req, struct bio *bio)
 {
-   return bio_will_gap(req->q, req->biotail, bio);
+   return bio_will_gap(req->q, req, req->biotail, bio);
 }
 
 static inline bool req_gap_front_merge(struct request *req, struct bio *bio)
 {
-   return bio_will_gap(req->q, bio, req->bio);
+   return bio_will_gap(req->q, NULL, bio, req->bio);
 }
 
 static inline void blk_dump_rq(const struct request *req, const char *msg)
-- 
2.9.3



Re: [PATCH] block: bios with an offset are always gappy

2017-04-13 Thread Johannes Thumshirn
On Thu, Apr 13, 2017 at 10:45:10PM +0800, Ming Lei wrote:
> On Thu, Apr 13, 2017 at 01:53:28PM +0200, Johannes Thumshirn wrote:
> > On Thu, Apr 13, 2017 at 06:02:21PM +0800, Ming Lei wrote:
> > > On Thu, Apr 13, 2017 at 10:06:29AM +0200, Johannes Thumshirn wrote:
> > > > Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic
> > > > in nvme_setup_prps() because the dma_len will drop below zero but the
> > > > length not.
> > > 
> > > Looks I can't reproduce the issue in QEMU(32G nvme, either partitioned
> > > or not, just use 'mkfs.btrfs /dev/nvme0n1p1'), could you share the exact
> > > mkfs command line and size of your emulated NVMe?
> > 
> > the exact cmdline is mkfs.btrfs -f /dev/nvme0n1p1 (-f because there was a
> > existing btrfs on the image). The image is 17179869184 (a.k.a 16G) bytes.
> > 
> > [...]
> > 
> > > Could you try the following patch to see if it fixes your issue?
> > 
> > It's back to the old, erratic behaviour, see log below.
> 
> Johannes, could you test the following patch?
> 
> Thanks
> Ming

Works, awesome thanks!

Tested-by: Johannes Thumshirn 
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH] block: bios with an offset are always gappy

2017-04-13 Thread Ming Lei
On Thu, Apr 13, 2017 at 01:53:28PM +0200, Johannes Thumshirn wrote:
> On Thu, Apr 13, 2017 at 06:02:21PM +0800, Ming Lei wrote:
> > On Thu, Apr 13, 2017 at 10:06:29AM +0200, Johannes Thumshirn wrote:
> > > Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic
> > > in nvme_setup_prps() because the dma_len will drop below zero but the
> > > length not.
> > 
> > Looks I can't reproduce the issue in QEMU(32G nvme, either partitioned
> > or not, just use 'mkfs.btrfs /dev/nvme0n1p1'), could you share the exact
> > mkfs command line and size of your emulated NVMe?
> 
> the exact cmdline is mkfs.btrfs -f /dev/nvme0n1p1 (-f because there was a
> existing btrfs on the image). The image is 17179869184 (a.k.a 16G) bytes.
> 
> [...]
> 
> > Could you try the following patch to see if it fixes your issue?
> 
> It's back to the old, erratic behaviour, see log below.

Johannes, could you test the following patch?

Thanks
Ming

---

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b75d6fe5a1b9..cc68dfaef528 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1672,12 +1672,27 @@ static inline bool bios_segs_mergeable(struct 
request_queue *q,
return true;
 }
 
-static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
-struct bio *next)
+static inline bool bio_will_gap(struct request_queue *q,
+   struct request *prev_rq,
+   struct bio *prev,
+   struct bio *next)
 {
if (bio_has_data(prev) && queue_virt_boundary(q)) {
struct bio_vec pb, nb;
 
+   /*
+* don't merge if the 1st bio starts with non-zero
+* offset, otherwise it is quite difficult to respect
+* sg gap limit. We work hard to merge huge of small
+* bios in case of mkfs.
+*/
+   if (prev_rq)
+   bio_get_first_bvec(prev_rq->bio, );
+   else
+   bio_get_first_bvec(prev, );
+   if (pb.bv_offset)
+   return true;
+
bio_get_last_bvec(prev, );
bio_get_first_bvec(next, );
 
@@ -1690,12 +1705,12 @@ static inline bool bio_will_gap(struct request_queue 
*q, struct bio *prev,
 
 static inline bool req_gap_back_merge(struct request *req, struct bio *bio)
 {
-   return bio_will_gap(req->q, req->biotail, bio);
+   return bio_will_gap(req->q, req, req->biotail, bio);
 }
 
 static inline bool req_gap_front_merge(struct request *req, struct bio *bio)
 {
-   return bio_will_gap(req->q, bio, req->bio);
+   return bio_will_gap(req->q, NULL, bio, req->bio);
 }
 
 static inline void blk_dump_rq(const struct request *req, const char *msg)


Re: [PATCH] block: bios with an offset are always gappy

2017-04-13 Thread Ming Lei
On Thu, Apr 13, 2017 at 02:20:10PM +0200, Johannes Thumshirn wrote:
> On Thu, Apr 13, 2017 at 08:11:40PM +0800, Ming Lei wrote:
> > Ok, could you apply the attached debug patch and collect the
> > ftrace log? (ftrace_dump_on_oops need to be passed to kernel cmd line).
>

Thanks Johannes very much for collecting the log.

The root cause should be the following:

- if 'offset' of one segment isn't zero, we can't make that segment
  at full size, otherwise the segment ends in the unaligned
  virt boundary.

Give me a while, I will figure out one patch to address the issue.


Thanks,
Ming


Re: Outstanding MQ questions from MMC

2017-04-13 Thread Linus Walleij
On Fri, Mar 31, 2017 at 10:43 AM, Arnd Bergmann  wrote:
> On Thu, Mar 30, 2017 at 6:39 PM, Ulf Hansson  wrote:
>> On 30 March 2017 at 14:42, Arnd Bergmann  wrote:
>>> On Wed, Mar 29, 2017 at 5:09 AM, Linus Walleij  
>>> wrote:
>
 In MQ, I have simply locked the host on the first request and then I never
 release it. Clearly this does not work. I am uncertain on how to handle 
 this
 and whether MQ has a way to tell us that the queue is empty so we may 
 release
 the host. I toyed with the idea to just set up a timer, but a "queue
 empty" callback
 from the block layer is what would be ideal.
>>>
>>> Would it be possible to change the userspace code to go through
>>> the block layer instead and queue a request there, to avoid having
>>> to lock the card at all?
>>
>> That would be good from an I/O scheduling point of view, as it would
>> avoid one side being able to starve the other.
>>
>> However, we would still need a lock, as we also have card detect work
>> queue, which also needs to claim the host when it polls for removable
>> cards.
>
> Hmm, In theory the card-detect work queue should not be active
> at the same time as any I/O, but I see you point. Could the card-detect
> wq perhaps also use the blk queue for sending a special request that
> triggers the detection logic?
>
> Alternatively, I had this idea that we could translate blk requests into
> mmc commands and then have a (short fixed length) set of outstanding
> mmc commands in the device that always get done in order. The card
> detect and the user space I/O would then directly put mmc commands
> onto the command queue, as would the blk-mq scheduler. You
> still need a lock to access that command queue, but the mmc host
> would just always pick the next command off the list when one
> command completes.

I looked into this.

The block layer queue can wrap and handle custom device commands
using REQ_OP_DRV_IN/OUT, and that seems to be the best way
to play with the block layer IMO.

The card detect work is a special case because it is also used by
SDIO which does not use the block layer. But that could maybe be
solved by a separate host lock just for the SDIO case, letting
devices accessed as block devices use the method of inserting
custom commands.

I looked at how e.g. IDE and SCSI does this, drivers/ide/ide-ioctls.c
looks like this nowadays:

static int generic_drive_reset(ide_drive_t *drive)
{
struct request *rq;
int ret = 0;

rq = blk_get_request(drive->queue, REQ_OP_DRV_IN, __GFP_RECLAIM);
scsi_req_init(rq);
ide_req(rq)->type = ATA_PRIV_MISC;
scsi_req(rq)->cmd_len = 1;
scsi_req(rq)->cmd[0] = REQ_DRIVE_RESET;
if (blk_execute_rq(drive->queue, NULL, rq, 1))
ret = rq->errors;
blk_put_request(rq);
return ret;
}

So it creates a custom REQ_OP_DRV_IN request, then scsi_req_init()
sets up the special command, in this case
ATA_PRIV_MISC/REQ_DRIVE_RESET and toss this into the block
queue like everything else.

We could do the same, especially the RPMB operations should
probably have been done like this from the beginning. But due to
historical factors they were not.

It is a bit hairy and the whole thing is in a bit of flux because Christoph
is heavily refactoring this and cleaning up the old block devices as
we speak (I bet) so it is a bit hard to do the right thing.

I easily get confused here ... for example there is custom
per-request data access by this simple:

scsi_req_init(rq)

which does

struct scsi_request *req = scsi_req(rq);

which does

static inline struct scsi_request *scsi_req(struct request *rq)
{
return blk_mq_rq_to_pdu(rq);
}

Oohps blk_mq_* namespace? You would assume this means that
you have to use blk-mq? Nah, I think not, because all it does is:

static inline void *blk_mq_rq_to_pdu(struct request *rq)
{
return rq + 1;
}

So while we have to #include  this is one of these
mixed semantics that just give you a pointer to something behind
the request, a method that is simple and natural in blk-mq but which
is (I guess) set up by some other mechanism in the !mq case,
albeit access by this inline.

And I have to do this with the old block layer to get to a point
where we can start using blk-mq, sigh.

The border between blk and blk-mq is a bit blurry right now.

With blk-mq I do this:

mq->tag_set.cmd_size = sizeof(foo_cmd);
blk_mq_alloc_tag_set(...)

To do this with the old blk layer I may need some help to figure
out how to set up per-request additional data in a way that works
with the old layer.

scsi_lib.c scsi_alloc_queue() does this:

q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE);
if (!q)
  return NULL;
q->cmd_size = sizeof(foo_cmd);

And this means there will be sizeof(foo_cmd) after the request
that can be dereferenced by blk_mq_rq_to_pdu(rq);...

Yeah I'll try it.

Just trying to give a 

Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically

2017-04-13 Thread Benjamin Block
On Wed, Apr 12, 2017 at 06:11:25PM +, Bart Van Assche wrote:
> On Wed, 2017-04-12 at 12:55 +0200, Benjamin Block wrote:
> > On Fri, Apr 07, 2017 at 11:16:48AM -0700, Bart Van Assche wrote:
> > > The six patches in this patch series fix the queue lockup I reported
> > > recently on the linux-block mailing list. Please consider these patches
> > > for inclusion in the upstream kernel.
> >
> > just out of curiosity. Is this maybe related to similar stuff happening
> > when CPUs are hot plugged - at least in that the stack gets stuck? Like
> > in this thread here:
> > https://www.mail-archive.com/linux-block@vger.kernel.org/msg06057.html
> >
> > Would be interesting, because we recently saw similar stuff happening.
>
> Hello Benjamin,
>
> My proposal is to repeat that test with Jens' for-next branch. If the issue
> still occurs with that tree then please check the contents of
> /sys/kernel/debug/block/*/mq/*/{dispatch,*/rq_list}. That will allow to
> determine whether or not any block layer requests are still pending. If
> running the command below resolves the deadlock then it means that a
> trigger to run a block layer queue is still missing somewhere:
>
> for a in /sys/kernel/debug/block/*/mq/state; do echo run >$a; done
>
> See also git://git.kernel.dk/linux-block.git.
>

Thx for the hint! I'll forward that and see if the affected folks are
willing to reproduce.


Beste Grüße / Best regards,
  - Benjamin Block
--
Linux on z Systems Development / IBM Systems & Technology Group
  IBM Deutschland Research & Development GmbH
Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294



Re: [PATCH] block: bios with an offset are always gappy

2017-04-13 Thread Ming Lei
On Thu, Apr 13, 2017 at 01:53:28PM +0200, Johannes Thumshirn wrote:
> On Thu, Apr 13, 2017 at 06:02:21PM +0800, Ming Lei wrote:
> > On Thu, Apr 13, 2017 at 10:06:29AM +0200, Johannes Thumshirn wrote:
> > > Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic
> > > in nvme_setup_prps() because the dma_len will drop below zero but the
> > > length not.
> > 
> > Looks I can't reproduce the issue in QEMU(32G nvme, either partitioned
> > or not, just use 'mkfs.btrfs /dev/nvme0n1p1'), could you share the exact
> > mkfs command line and size of your emulated NVMe?
> 
> the exact cmdline is mkfs.btrfs -f /dev/nvme0n1p1 (-f because there was a
> existing btrfs on the image). The image is 17179869184 (a.k.a 16G) bytes.
> 
> [...]
> 
> > Could you try the following patch to see if it fixes your issue?
> 
> It's back to the old, erratic behaviour, see log below.

Ok, could you apply the attached debug patch and collect the
ftrace log? (ftrace_dump_on_oops need to be passed to kernel cmd line).


Thanks,
Ming

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 26a5fd05fe88..a813a36d48d9 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -491,6 +491,8 @@ static bool nvme_setup_prps(struct nvme_dev *dev, struct 
request *req)
break;
if (dma_len > 0)
continue;
+   if (dma_len < 0)
+   blk_dump_rq(req, "nvme dma sg gap");
BUG_ON(dma_len < 0);
sg = sg_next(sg);
dma_addr = sg_dma_address(sg);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e521194f6fc..f3b001e401d2 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -811,5 +811,29 @@ static inline int bio_integrity_add_page(struct bio *bio, 
struct page *page,
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+static inline void blk_dump_bio(struct bio *bio, const char *msg)
+{
+   struct bvec_iter iter;
+   struct bio_vec bvec;
+   int i = 0;
+   unsigned sectors = 0;
+
+   trace_printk("%s-%p: %hx/%hx %u %llu %u\n",
+   msg, bio,
+   bio->bi_flags, bio->bi_opf,
+   bio->bi_phys_segments,
+   (unsigned long long)bio->bi_iter.bi_sector,
+   bio->bi_iter.bi_size);
+   bio_for_each_segment(bvec, bio, iter) {
+   sectors += bvec.bv_len >> 9;
+   trace_printk("\t %d: %lu %u %u(%u)\n", i++,
+   (unsigned long)page_to_pfn(bvec.bv_page),
+   bvec.bv_offset,
+   bvec.bv_len, bvec.bv_len >> 12);
+   }
+   trace_printk("\t total sectors %u\n", sectors);
+}
+
+
 #endif /* CONFIG_BLOCK */
 #endif /* __LINUX_BIO_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7548f332121a..b75d6fe5a1b9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1698,6 +1698,22 @@ static inline bool req_gap_front_merge(struct request 
*req, struct bio *bio)
return bio_will_gap(req->q, bio, req->bio);
 }
 
+static inline void blk_dump_rq(const struct request *req, const char *msg)
+{
+   struct bio *bio;
+   int i = 0;
+
+   trace_printk("%s: dump bvec for %p(f:%x, seg: %d)\n",
+   msg, req, req->cmd_flags,
+   req->nr_phys_segments);
+
+   __rq_for_each_bio(bio, req) {
+   char num[16];
+   snprintf(num, 16, "%d", i++);
+   blk_dump_bio(bio, num);
+   }
+}
+
 int kblockd_schedule_work(struct work_struct *work);
 int kblockd_schedule_work_on(int cpu, struct work_struct *work);
 int kblockd_schedule_delayed_work(struct delayed_work *dwork, unsigned long 
delay);


Re: [PATCH] block: bios with an offset are always gappy

2017-04-13 Thread Johannes Thumshirn
On Thu, Apr 13, 2017 at 06:02:21PM +0800, Ming Lei wrote:
> On Thu, Apr 13, 2017 at 10:06:29AM +0200, Johannes Thumshirn wrote:
> > Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic
> > in nvme_setup_prps() because the dma_len will drop below zero but the
> > length not.
> 
> Looks I can't reproduce the issue in QEMU(32G nvme, either partitioned
> or not, just use 'mkfs.btrfs /dev/nvme0n1p1'), could you share the exact
> mkfs command line and size of your emulated NVMe?

the exact cmdline is mkfs.btrfs -f /dev/nvme0n1p1 (-f because there was a
existing btrfs on the image). The image is 17179869184 (a.k.a 16G) bytes.

[...]

> Could you try the following patch to see if it fixes your issue?

It's back to the old, erratic behaviour, see log below.
> 
> ---
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 7548f332121a..65d1510681c6 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1659,16 +1659,28 @@ static inline bool bvec_gap_to_prev(struct 
> request_queue *q,
>   * and the 1st bvec in the 2nd bio can be handled in one segment.
>   */
>  static inline bool bios_segs_mergeable(struct request_queue *q,
> - struct bio *prev, struct bio_vec *prev_last_bv,
> + struct bio *prev, struct bio *next,
> + struct bio_vec *prev_last_bv,
>   struct bio_vec *next_first_bv)
>  {
>   if (!BIOVEC_PHYS_MERGEABLE(prev_last_bv, next_first_bv))
>   return false;
>   if (!BIOVEC_SEG_BOUNDARY(q, prev_last_bv, next_first_bv))
>   return false;
> - if (prev->bi_seg_back_size + next_first_bv->bv_len >
> + if (prev->bi_seg_back_size + next->bi_seg_front_size >
>   queue_max_segment_size(q))
>   return false;
> +
> + /*
> +  * if 'next' has multiple segments, we need to make
> +  * sure the merged segment from 'pb' and the 1st segment
> +  * of 'next' ends at aligned virt boundary.
> +  */
> + if ((next->bi_seg_front_size < next->bi_iter.bi_size) &&
> + ((prev_last_bv->bv_offset + prev_last_bv->bv_len +
> +  next->bi_seg_front_size) & queue_virt_boundary(q)))
> + return false;
> +
>   return true;
>  }
>  
> @@ -1681,7 +1693,7 @@ static inline bool bio_will_gap(struct request_queue 
> *q, struct bio *prev,
>   bio_get_last_bvec(prev, );
>   bio_get_first_bvec(next, );
>  
> - if (!bios_segs_mergeable(q, prev, , ))
> + if (!bios_segs_mergeable(q, prev, next, , ))
>   return __bvec_gap_to_prev(q, , nb.bv_offset);
>   }



dracut:/# [1.211567] tsc: Refined TSC clocksource calibration: 2297.338 MHz
[1.212601] clocksource: tsc: mask: 0x max_cycles: 
0x211d6274d86, max_idle_ns: 440795243673 ns
dracut:/# modprobe btrfs
[8.139179] raid6_pq: module verification failed: signature and/or required 
key missing - tainting kernel
[8.207509] raid6: sse2x1   gen()  6827 MB/s
[8.275512] raid6: sse2x1   xor()  5654 MB/s
[8.343507] raid6: sse2x2   gen() 11573 MB/s
[8.411503] raid6: sse2x2   xor()  8826 MB/s
[8.479504] raid6: sse2x4   gen() 14794 MB/s
[8.547504] raid6: sse2x4   xor() 10618 MB/s
[8.547830] raid6: using algorithm sse2x4 gen() 14794 MB/s
[8.548218] raid6:  xor() 10618 MB/s, rmw enabled
[8.548558] raid6: using intx1 recovery algorithm
[8.549341] xor: measuring software checksum speed
[8.587533]prefetch64-sse: 15090.000 MB/sec
[8.627553]generic_sse: 13530.000 MB/sec
[8.627945] xor: using function: prefetch64-sse (15090.000 MB/sec)
[8.633795] Btrfs loaded, crc32c=crc32c-generic, assert=on
dracut:/# modprobe nvme
[   12.348762] nvme nvme0: pci function :00:04.0
[   12.386300] ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 11
dracut:/# [   12.391707]  nvme0n1: p1
dracut:/# mkfs.b[   36.553376] random: fast init done
tr
dracut:/# mkfs.btrfs -f /dev/nvme0n1p1
btrfs-progs v4.5.3+20160729
See http://btrfs.wiki.kernel.org for more information.

Detected a SSD, turning off metadata duplication.  Mkfs with -m dup if you want 
to force metadata duplication.
[   46.696671] [ cut here ]
[   46.697338] kernel BUG at drivers/nvme/host/pci.c:494!
[   46.697806] invalid opcode:  [#1] SMP
[   46.698175] Modules linked in: nvme(E) nvme_core(E) btrfs(E) xor(E) 
raid6_pq(E)
[   46.698879] CPU: 1 PID: 18 Comm: kworker/1:0H Tainted: GE   
4.11.0-rc6-default+ #43
[   46.699686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[   46.700737] Workqueue: kblockd blk_mq_run_work_fn
[   46.701169] task: 88007bd24540 task.stack: c93bc000
[   46.701709] RIP: 0010:nvme_queue_rq+0x85d/0x886 [nvme]
[   46.702185] RSP: 0018:c93bfc78 EFLAGS: 00010286
[   46.702670] RAX: 0078 RBX: 1000 RCX: 7f625000
[   

Re: [PATCH] block: bios with an offset are always gappy

2017-04-13 Thread Johannes Thumshirn
On Thu, Apr 13, 2017 at 06:02:21PM +0800, Ming Lei wrote:
> Could you try the following patch to see if it fixes your issue?

Sure, jsut have a short lunch break and then I'll report back.
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH] block: bios with an offset are always gappy

2017-04-13 Thread Ming Lei
On Thu, Apr 13, 2017 at 10:06:29AM +0200, Johannes Thumshirn wrote:
> Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic
> in nvme_setup_prps() because the dma_len will drop below zero but the
> length not.

Looks I can't reproduce the issue in QEMU(32G nvme, either partitioned
or not, just use 'mkfs.btrfs /dev/nvme0n1p1'), could you share the exact
mkfs command line and size of your emulated NVMe?

> 
> A git bisect tracked the behaviour down to commit 729204ef49ec ("block:
> relax check on sg gap"). Since commit 729204ef49ec a bio's offsets are not
> taken into account in the decision if the bio will gap any more. Restore
> the old behavior of checking bio offsets as well for the decision if a
> bio will gap.
> 
> Signed-off-by: Johannes Thumshirn 
> Fixes: 729204ef49ec ("block: relax check on sg gap")
> Cc: Ming Lei 
> ---
>  include/linux/blkdev.h | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 7548f332121a..a03b7196209e 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1677,11 +1677,14 @@ static inline bool bio_will_gap(struct request_queue 
> *q, struct bio *prev,
>  {
>   if (bio_has_data(prev) && queue_virt_boundary(q)) {
>   struct bio_vec pb, nb;
> + bool offset;
>  
>   bio_get_last_bvec(prev, );
>   bio_get_first_bvec(next, );
>  
> - if (!bios_segs_mergeable(q, prev, , ))
> + offset = pb.bv_offset || nb.bv_offset;

We don't consider pb's offset here, because if pb.bv_offset
isn't zero, pb should be the only bvec in 'prev' and will be
put in a standalone segement, and we still can make 'nb' into
this segment if both are mergeable.

But the following issue might be caused by commit 729204ef49ec
("block: relax check on sg gap"):

- if the 'next' has more than one segment
- the segment merged from 'pb' and the 1st segment of 'next'
may end at un-aligned virt boundary

Could you try the following patch to see if it fixes your issue?

---
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7548f332121a..65d1510681c6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1659,16 +1659,28 @@ static inline bool bvec_gap_to_prev(struct 
request_queue *q,
  * and the 1st bvec in the 2nd bio can be handled in one segment.
  */
 static inline bool bios_segs_mergeable(struct request_queue *q,
-   struct bio *prev, struct bio_vec *prev_last_bv,
+   struct bio *prev, struct bio *next,
+   struct bio_vec *prev_last_bv,
struct bio_vec *next_first_bv)
 {
if (!BIOVEC_PHYS_MERGEABLE(prev_last_bv, next_first_bv))
return false;
if (!BIOVEC_SEG_BOUNDARY(q, prev_last_bv, next_first_bv))
return false;
-   if (prev->bi_seg_back_size + next_first_bv->bv_len >
+   if (prev->bi_seg_back_size + next->bi_seg_front_size >
queue_max_segment_size(q))
return false;
+
+   /*
+* if 'next' has multiple segments, we need to make
+* sure the merged segment from 'pb' and the 1st segment
+* of 'next' ends at aligned virt boundary.
+*/
+   if ((next->bi_seg_front_size < next->bi_iter.bi_size) &&
+   ((prev_last_bv->bv_offset + prev_last_bv->bv_len +
+next->bi_seg_front_size) & queue_virt_boundary(q)))
+   return false;
+
return true;
 }
 
@@ -1681,7 +1693,7 @@ static inline bool bio_will_gap(struct request_queue *q, 
struct bio *prev,
bio_get_last_bvec(prev, );
bio_get_first_bvec(next, );
 
-   if (!bios_segs_mergeable(q, prev, , ))
+   if (!bios_segs_mergeable(q, prev, next, , ))
return __bvec_gap_to_prev(q, , nb.bv_offset);
}
 

-- 
Ming


Re: [PATCH] block: bios with an offset are always gappy

2017-04-13 Thread Johannes Thumshirn
On Thu, Apr 13, 2017 at 11:48:35AM +0200, Christoph Hellwig wrote:
> On Thu, Apr 13, 2017 at 10:06:29AM +0200, Johannes Thumshirn wrote:
> > Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic
> > in nvme_setup_prps() because the dma_len will drop below zero but the
> > length not.
> 
> I think we should also turns this into a WARN_ON_ONCE + error return..
> 
> But do you have an exact btrfsprogs version and command line?  I do a lot
> of testing that involves mkfs.btrfs on nvme and haven't seen it..

Ah one detail I forgot: mkfs.xfs _does_ work. Haven't checked ext4 though.

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH] block: bios with an offset are always gappy

2017-04-13 Thread Johannes Thumshirn
On Thu, Apr 13, 2017 at 11:48:35AM +0200, Christoph Hellwig wrote:
> I think we should also turns this into a WARN_ON_ONCE + error return..
> 
> But do you have an exact btrfsprogs version and command line?  I do a lot
> of testing that involves mkfs.btrfs on nvme and haven't seen it..

Sure, it's:
mkfs.btrfs, part of btrfs-progs v4.5.3+20160729

Qemu is 2.6.2

[...]

> 
> I think the code in NVMe (and potentially the other drivers using
> virt_queue_boundary) is bogus.  All of them are actually fine with
> gaps in the protocol, as long as the gaps are aligned to said boundary.
> 
> So I suspect what we really need is to fix up NVMe, and after that
> we could even relax the above check, to not check for offset but
> offset & queue_virt_boundary(q).

That's what I tried doing the last two days but as we're rather late in the rc
cycle and it is a regression that came in with -rc1 I'd rather like to have it
fixed or at least have a band aid in place.

Byte,
Johannes

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH] block: bios with an offset are always gappy

2017-04-13 Thread Christoph Hellwig
On Thu, Apr 13, 2017 at 10:06:29AM +0200, Johannes Thumshirn wrote:
> Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic
> in nvme_setup_prps() because the dma_len will drop below zero but the
> length not.

I think we should also turns this into a WARN_ON_ONCE + error return..

But do you have an exact btrfsprogs version and command line?  I do a lot
of testing that involves mkfs.btrfs on nvme and haven't seen it..

> A git bisect tracked the behaviour down to commit 729204ef49ec ("block:
> relax check on sg gap"). Since commit 729204ef49ec a bio's offsets are not
> taken into account in the decision if the bio will gap any more. Restore
> the old behavior of checking bio offsets as well for the decision if a
> bio will gap.
> 
> Signed-off-by: Johannes Thumshirn 
> Fixes: 729204ef49ec ("block: relax check on sg gap")
> Cc: Ming Lei 
> ---
>  include/linux/blkdev.h | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 7548f332121a..a03b7196209e 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1677,11 +1677,14 @@ static inline bool bio_will_gap(struct request_queue 
> *q, struct bio *prev,
>  {
>   if (bio_has_data(prev) && queue_virt_boundary(q)) {
>   struct bio_vec pb, nb;
> + bool offset;
>  
>   bio_get_last_bvec(prev, );
>   bio_get_first_bvec(next, );
>  
> - if (!bios_segs_mergeable(q, prev, , ))
> + offset = pb.bv_offset || nb.bv_offset;
> +
> + if (offset || !bios_segs_mergeable(q, prev, , ))
>   return __bvec_gap_to_prev(q, , nb.bv_offset);
>   }

I think the code in NVMe (and potentially the other drivers using
virt_queue_boundary) is bogus.  All of them are actually fine with
gaps in the protocol, as long as the gaps are aligned to said boundary.

So I suspect what we really need is to fix up NVMe, and after that
we could even relax the above check, to not check for offset but
offset & queue_virt_boundary(q).


Re: [PATCH 2/3] block: remove blk_end_request_cur

2017-04-13 Thread Johannes Thumshirn
On Wed, Apr 12, 2017 at 12:13:58PM +0200, Christoph Hellwig wrote:
> This function is not used anywhere in the kernel.
> 
> Signed-off-by: Christoph Hellwig 
> ---
Looks good, though one could argue it can be merged into the last patch.
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 1/3] block: remove blk_end_request_err and __blk_end_request_err

2017-04-13 Thread Johannes Thumshirn
On Wed, Apr 12, 2017 at 12:13:57PM +0200, Christoph Hellwig wrote:
> Both functions are entirely unused.
> 
> Signed-off-by: Christoph Hellwig 
> ---

Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


[PATCH] block: bios with an offset are always gappy

2017-04-13 Thread Johannes Thumshirn
Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic
in nvme_setup_prps() because the dma_len will drop below zero but the
length not.

A git bisect tracked the behaviour down to commit 729204ef49ec ("block:
relax check on sg gap"). Since commit 729204ef49ec a bio's offsets are not
taken into account in the decision if the bio will gap any more. Restore
the old behavior of checking bio offsets as well for the decision if a
bio will gap.

Signed-off-by: Johannes Thumshirn 
Fixes: 729204ef49ec ("block: relax check on sg gap")
Cc: Ming Lei 
---
 include/linux/blkdev.h | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7548f332121a..a03b7196209e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1677,11 +1677,14 @@ static inline bool bio_will_gap(struct request_queue 
*q, struct bio *prev,
 {
if (bio_has_data(prev) && queue_virt_boundary(q)) {
struct bio_vec pb, nb;
+   bool offset;
 
bio_get_last_bvec(prev, );
bio_get_first_bvec(next, );
 
-   if (!bios_segs_mergeable(q, prev, , ))
+   offset = pb.bv_offset || nb.bv_offset;
+
+   if (offset || !bios_segs_mergeable(q, prev, , ))
return __bvec_gap_to_prev(q, , nb.bv_offset);
}
 
-- 
2.12.0