date:20180521

Re: [PATCH v2 14/26] ibtrs: include client and server modules into kernel compilation

2018-05-21 Thread Leon Romanovsky

On Fri, May 18, 2018 at 03:04:01PM +0200, Roman Pen wrote:
> Add IBTRS Makefile, Kconfig and also corresponding lines into upper
> layer infiniband/ulp files.
>
> Signed-off-by: Roman Pen 
> Signed-off-by: Danil Kipnis 
> Cc: Jack Wang 
> ---
>  drivers/infiniband/Kconfig|  1 +
>  drivers/infiniband/ulp/Makefile   |  1 +
>  drivers/infiniband/ulp/ibtrs/Kconfig  | 20 
>  drivers/infiniband/ulp/ibtrs/Makefile | 15 +++
>  4 files changed, 37 insertions(+)
>  create mode 100644 drivers/infiniband/ulp/ibtrs/Kconfig
>  create mode 100644 drivers/infiniband/ulp/ibtrs/Makefile
>
> diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
> index ee270e065ba9..787bd286fb08 100644
> --- a/drivers/infiniband/Kconfig
> +++ b/drivers/infiniband/Kconfig
> @@ -94,6 +94,7 @@ source "drivers/infiniband/ulp/srpt/Kconfig"
>
>  source "drivers/infiniband/ulp/iser/Kconfig"
>  source "drivers/infiniband/ulp/isert/Kconfig"
> +source "drivers/infiniband/ulp/ibtrs/Kconfig"
>
>  source "drivers/infiniband/ulp/opa_vnic/Kconfig"
>  source "drivers/infiniband/sw/rdmavt/Kconfig"
> diff --git a/drivers/infiniband/ulp/Makefile b/drivers/infiniband/ulp/Makefile
> index 437813c7b481..1c4f10dc8d49 100644
> --- a/drivers/infiniband/ulp/Makefile
> +++ b/drivers/infiniband/ulp/Makefile
> @@ -5,3 +5,4 @@ obj-$(CONFIG_INFINIBAND_SRPT) += srpt/
>  obj-$(CONFIG_INFINIBAND_ISER)+= iser/
>  obj-$(CONFIG_INFINIBAND_ISERT)   += isert/
>  obj-$(CONFIG_INFINIBAND_OPA_VNIC)+= opa_vnic/
> +obj-$(CONFIG_INFINIBAND_IBTRS)   += ibtrs/
> diff --git a/drivers/infiniband/ulp/ibtrs/Kconfig 
> b/drivers/infiniband/ulp/ibtrs/Kconfig
> new file mode 100644
> index ..eaeb8f3f6b4e
> --- /dev/null
> +++ b/drivers/infiniband/ulp/ibtrs/Kconfig
> @@ -0,0 +1,20 @@
> +config INFINIBAND_IBTRS
> + tristate
> + depends on INFINIBAND_ADDR_TRANS
> +
> +config INFINIBAND_IBTRS_CLIENT
> + tristate "IBTRS client module"
> + depends on INFINIBAND_ADDR_TRANS
> + select INFINIBAND_IBTRS
> + help
> +   IBTRS client allows for simplified data transfer and connection
> +   establishment over RDMA (InfiniBand, RoCE, iWarp). Uses BIO-like
> +   READ/WRITE semantics and provides multipath capabilities.
> +
> +config INFINIBAND_IBTRS_SERVER
> + tristate "IBTRS server module"
> + depends on INFINIBAND_ADDR_TRANS
> + select INFINIBAND_IBTRS
> + help
> +   IBTRS server module processing connection and IO requests received
> +   from the IBTRS client module.
> diff --git a/drivers/infiniband/ulp/ibtrs/Makefile 
> b/drivers/infiniband/ulp/ibtrs/Makefile
> new file mode 100644
> index ..e6ea858745ad
> --- /dev/null
> +++ b/drivers/infiniband/ulp/ibtrs/Makefile
> @@ -0,0 +1,15 @@
> +ibtrs-client-y := ibtrs-clt.o \
> +   ibtrs-clt-stats.o \
> +   ibtrs-clt-sysfs.o
> +
> +ibtrs-server-y := ibtrs-srv.o \
> +   ibtrs-srv-stats.o \
> +   ibtrs-srv-sysfs.o
> +
> +ibtrs-core-y := ibtrs.o
> +
> +obj-$(CONFIG_INFINIBAND_IBTRS)+= ibtrs-core.o

Will it build ibtrs-core in case both server and client are disabled in .config?

> +obj-$(CONFIG_INFINIBAND_IBTRS_CLIENT) += ibtrs-client.o
> +obj-$(CONFIG_INFINIBAND_IBTRS_SERVER) += ibtrs-server.o
> +
> +-include $(src)/compat/compat.mk

What is this?


> --
> 2.13.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-05-21 Thread Jens Axboe

On May 21, 2018, at 9:47 PM, Ming Lei  wrote:
> 
>> On Mon, May 21, 2018 at 09:16:33PM -0600, Jens Axboe wrote:
>>> On 5/21/18 8:49 PM, Ming Lei wrote:
 On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote:
 This patch simplifies the timeout handling by relying on the request
 reference counting to ensure the iterator is operating on an inflight
>>> 
>>> The reference counting isn't free, what is the real benefit in this way?
>> 
>> Neither is the current scheme and locking, and this is a hell of a lot
>> simpler. I'd get rid of the kref stuff and just do a simple
>> atomic_dec_and_test(). Most of the time we should be uncontended on
>> that, which would make it pretty darn cheap. I'd be surprised if it
>> wasn't better than the current alternatives.
> 
> The explicit memory barriers by atomic_dec_and_test() isn't free.

I’m not saying it’s free. Neither is our current synchronization.

> Also the double completion issue need to be fixed before discussing
> this approach further.

Certainly. Also not saying that the current patch is perfect. But it’s a lot 
more palatable than the alternatives, hence my interest. And I’d like for this 
issue to get solved, we seem to be a bit stuck atm. 


I’ll take a closer look tomorrow.

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-05-21 Thread Ming Lei

On Mon, May 21, 2018 at 09:16:33PM -0600, Jens Axboe wrote:
> On 5/21/18 8:49 PM, Ming Lei wrote:
> > On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote:
> >> This patch simplifies the timeout handling by relying on the request
> >> reference counting to ensure the iterator is operating on an inflight
> > 
> > The reference counting isn't free, what is the real benefit in this way?
> 
> Neither is the current scheme and locking, and this is a hell of a lot
> simpler. I'd get rid of the kref stuff and just do a simple
> atomic_dec_and_test(). Most of the time we should be uncontended on
> that, which would make it pretty darn cheap. I'd be surprised if it
> wasn't better than the current alternatives.

The explicit memory barriers by atomic_dec_and_test() isn't free.

Also the double completion issue need to be fixed before discussing
this approach further.


Thanks,
Ming

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-05-21 Thread Jens Axboe

On 5/21/18 8:49 PM, Ming Lei wrote:
> On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote:
>> This patch simplifies the timeout handling by relying on the request
>> reference counting to ensure the iterator is operating on an inflight
> 
> The reference counting isn't free, what is the real benefit in this way?

Neither is the current scheme and locking, and this is a hell of a lot
simpler. I'd get rid of the kref stuff and just do a simple
atomic_dec_and_test(). Most of the time we should be uncontended on
that, which would make it pretty darn cheap. I'd be surprised if it
wasn't better than the current alternatives.

-- 
Jens Axboe

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-05-21 Thread Ming Lei

On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote:
> This patch simplifies the timeout handling by relying on the request
> reference counting to ensure the iterator is operating on an inflight

The reference counting isn't free, what is the real benefit in this way?

> and truly timed out request. Since the reference counting prevents the
> tag from being reallocated, the block layer no longer needs to prevent
> drivers from completing their requests while the timeout handler is
> operating on it: a driver completing a request is allowed to proceed to
> the next state without additional syncronization with the block layer.

This might cause trouble for drivers, since the previous behaviour
is that one request is only completed from one path, and now you change
the behaviour.

> 
> This also removes any need for generation sequence numbers since the
> request lifetime is prevented from being reallocated as a new sequence
> while timeout handling is operating on it.
> 
> Signed-off-by: Keith Busch 
> ---
>  block/blk-core.c   |   6 --
>  block/blk-mq-debugfs.c |   1 -
>  block/blk-mq.c | 259 
> ++---
>  block/blk-mq.h |  20 +---
>  block/blk-timeout.c|   1 -
>  include/linux/blkdev.h |  26 +
>  6 files changed, 58 insertions(+), 255 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 43370faee935..cee03cad99f2 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request 
> *rq)
>   rq->internal_tag = -1;
>   rq->start_time_ns = ktime_get_ns();
>   rq->part = NULL;
> - seqcount_init(>gstate_seq);
> - u64_stats_init(>aborted_gstate_sync);
> - /*
> -  * See comment of blk_mq_init_request
> -  */
> - WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
>  }
>  EXPORT_SYMBOL(blk_rq_init);
>  
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 3080e18cb859..ffa622366922 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -344,7 +344,6 @@ static const char *const rqf_name[] = {
>   RQF_NAME(STATS),
>   RQF_NAME(SPECIAL_PAYLOAD),
>   RQF_NAME(ZONE_WRITE_LOCKED),
> - RQF_NAME(MQ_TIMEOUT_EXPIRED),
>   RQF_NAME(MQ_POLL_SLEPT),
>  };
>  #undef RQF_NAME
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 66e5c768803f..4858876fd364 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -589,56 +589,6 @@ static void __blk_mq_complete_request(struct request *rq)
>   put_cpu();
>  }
>  
> -static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
> - __releases(hctx->srcu)
> -{
> - if (!(hctx->flags & BLK_MQ_F_BLOCKING))
> - rcu_read_unlock();
> - else
> - srcu_read_unlock(hctx->srcu, srcu_idx);
> -}
> -
> -static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
> - __acquires(hctx->srcu)
> -{
> - if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
> - /* shut up gcc false positive */
> - *srcu_idx = 0;
> - rcu_read_lock();
> - } else
> - *srcu_idx = srcu_read_lock(hctx->srcu);
> -}
> -
> -static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
> -{
> - unsigned long flags;
> -
> - /*
> -  * blk_mq_rq_aborted_gstate() is used from the completion path and
> -  * can thus be called from irq context.  u64_stats_fetch in the
> -  * middle of update on the same CPU leads to lockup.  Disable irq
> -  * while updating.
> -  */
> - local_irq_save(flags);
> - u64_stats_update_begin(>aborted_gstate_sync);
> - rq->aborted_gstate = gstate;
> - u64_stats_update_end(>aborted_gstate_sync);
> - local_irq_restore(flags);
> -}
> -
> -static u64 blk_mq_rq_aborted_gstate(struct request *rq)
> -{
> - unsigned int start;
> - u64 aborted_gstate;
> -
> - do {
> - start = u64_stats_fetch_begin(>aborted_gstate_sync);
> - aborted_gstate = rq->aborted_gstate;
> - } while (u64_stats_fetch_retry(>aborted_gstate_sync, start));
> -
> - return aborted_gstate;
> -}
> -
>  /**
>   * blk_mq_complete_request - end I/O on a request
>   * @rq:  the request being processed
> @@ -650,27 +600,10 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
>  void blk_mq_complete_request(struct request *rq)
>  {
>   struct request_queue *q = rq->q;
> - struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
> - int srcu_idx;
>  
>   if (unlikely(blk_should_fake_timeout(q)))
>   return;
> -
> - /*
> -  * If @rq->aborted_gstate equals the current instance, timeout is
> -  * claiming @rq and we lost.  This is synchronized through
> -  * hctx_lock().  See blk_mq_timeout_work() for details.
> -  *
> -  * Completion path never blocks and we can directly use RCU here
> -  * instead of

Re: [RFC PATCH 2/3] blk-mq: Fix timeout and state order

2018-05-21 Thread Ming Lei

On Mon, May 21, 2018 at 05:11:30PM -0600, Keith Busch wrote:
> The block layer had been setting the state to in-flight prior to updating
> the timer. This is the wrong order since the timeout handler could observe
> the in-flight state with the older timeout, believing the request had
> expired when in fact it is just getting started.
> 
> Signed-off-by: Keith Busch 
> ---
>  block/blk-mq.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 8b370ed75605..66e5c768803f 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -713,8 +713,8 @@ void blk_mq_start_request(struct request *rq)
>   preempt_disable();
>   write_seqcount_begin(>gstate_seq);
>  
> - blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
>   blk_add_timer(rq);
> + blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
>  
>   write_seqcount_end(>gstate_seq);
>   preempt_enable();
> -- 
> 2.14.3
> 

Looks fine,

Reviewed-by: Ming Lei 

Thanks,
Ming

Re: [RFC PATCH 1/3] blk-mq: Reference count request usage

2018-05-21 Thread Ming Lei

On Mon, May 21, 2018 at 05:11:29PM -0600, Keith Busch wrote:
> This patch adds a struct kref to struct request so that request users
> can be sure they're operating on the same request without it changing
> while they're processing it. The request's tag won't be released for
> reuse until the last user is done with it.
> 
> Signed-off-by: Keith Busch 
> ---
>  block/blk-mq.c | 30 +++---
>  include/linux/blkdev.h |  2 ++
>  2 files changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4cbfd784e837..8b370ed75605 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -332,6 +332,7 @@ static struct request *blk_mq_rq_ctx_init(struct 
> blk_mq_alloc_data *data,
>  #endif
>  
>   data->ctx->rq_dispatched[op_is_sync(op)]++;
> + kref_init(>ref);
>   return rq;
>  }
>  
> @@ -465,13 +466,33 @@ struct request *blk_mq_alloc_request_hctx(struct 
> request_queue *q,
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
>  
> +static void blk_mq_exit_request(struct kref *ref)
> +{
> + struct request *rq = container_of(ref, struct request, ref);
> + struct request_queue *q = rq->q;
> + struct blk_mq_ctx *ctx = rq->mq_ctx;
> + struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
> + const int sched_tag = rq->internal_tag;
> +
> + if (rq->tag != -1)
> + blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
> + if (sched_tag != -1)
> + blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
> + blk_mq_sched_restart(hctx);
> + blk_queue_exit(q);
> +}
> +
> +static void blk_mq_put_request(struct request *rq)
> +{
> +   kref_put(>ref, blk_mq_exit_request);
> +}
> +
>  void blk_mq_free_request(struct request *rq)
>  {
>   struct request_queue *q = rq->q;
>   struct elevator_queue *e = q->elevator;
>   struct blk_mq_ctx *ctx = rq->mq_ctx;
>   struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
> - const int sched_tag = rq->internal_tag;
>  
>   if (rq->rq_flags & RQF_ELVPRIV) {
>   if (e && e->type->ops.mq.finish_request)
> @@ -495,12 +516,7 @@ void blk_mq_free_request(struct request *rq)
>   blk_put_rl(blk_rq_rl(rq));
>  
>   blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
> - if (rq->tag != -1)
> - blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
> - if (sched_tag != -1)
> - blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
> - blk_mq_sched_restart(hctx);
> - blk_queue_exit(q);
> + blk_mq_put_request(rq);

Both the above line(atomic_try_cmpxchg_release is implied) and kref_init()
in blk_mq_rq_ctx_init() are run from fast path, and may introduce some cost,
you may have to run some benchmark to show if there is effect.

Also given the cost isn't free, could you describe a bit in comment log
what we can get with the cost?

Thanks,
Ming

Re: [PATCH 1/6] nvme: Sync request queues on reset

2018-05-21 Thread Ming Lei

On Mon, May 21, 2018 at 10:25:17AM -0600, Keith Busch wrote:
> On Tue, May 22, 2018 at 12:08:37AM +0800, Ming Lei wrote:
> > Please take a look at blk_mq_complete_request(). Even with Bart's
> > change, the request still won't be completed by driver. The request can
> > only be completed by either driver or blk-mq, not both.
> 
> So you're saying blk-mq can't complete a request the driver returned to
> blk-mq to complete. And that's the nvme driver's problem to fix?

For avoiding use-after-free, one request can only be completed by
one path, either by timeout path or normal completion(irq or cancel)
from driver.

So before handling this req's timeout, this request has to be
marked as completed by blk-mq timeout code already, then nvme_cancel_request()
can't cover this timed-out request.

Thanks,
Ming

Re: [PATCH 3/6] nvme: Move all IO out of controller reset

2018-05-21 Thread Ming Lei

On Mon, May 21, 2018 at 10:23:55AM -0600, Keith Busch wrote:
> On Tue, May 22, 2018 at 12:04:53AM +0800, Ming Lei wrote:
> > On Mon, May 21, 2018 at 09:44:33AM -0600, Keith Busch wrote:
> > > On Mon, May 21, 2018 at 11:34:27PM +0800, Ming Lei wrote:
> > > > nvme_dev_disable() quiesces queues first before killing queues.
> > > > 
> > > > If queues are quiesced during or before nvme_wait_freeze() is run
> > > > from the 2nd part of reset, the 2nd part can't move on, and IO hang
> > > > is caused. Finally no reset can be scheduled at all.
> > > 
> > > But this patch moves nvme_wait_freeze outside the reset path, so I'm
> > > afraid I'm unable to follow how you've concluded the wait freeze is
> > > somehow part of the reset.
> > 
> > For example:
> > 
> > 1) the 1st timeout event:
> > 
> > - nvme_dev_disable()
> > - reset
> > - scan_work
> > 
> > 2) the 2nd timeout event:
> > 
> > nvme_dev_disable() may come just after nvme_start_queues() in
> > the above reset of the 1st timeout. And nvme_timeout() won't
> > schedule a new reset since the controller state is NVME_CTRL_CONNECTING.
> 
> Let me get this straight -- you're saying nvme_start_queues is going
> to somehow immediately trigger timeout work? I can't see how that could

It may be difficult to trigger in reality, but won't be impossible
since the timeout value can be adjusted via module parameter, and any
schedule delay can be introduced in one busy system.

It isn't a good practice to rely on timing for avoiding race, IMO.

> possibly happen in real life, but we can just remove it and use the existing
> nvme_start_ctrl to handle that in the LIVE state.

OK, please fix other issues mentioned in the following comment together,
then I will review further and see if it can work well.

https://marc.info/?l=linux-block=152668465710591=2

Thanks,
Ming

Re: [PATCH 16/34] iomap: add initial support for writes without buffer heads

2018-05-21 Thread Dave Chinner

On Mon, May 21, 2018 at 04:27:00PM -0700, Darrick J. Wong wrote:
> On Fri, May 18, 2018 at 06:48:12PM +0200, Christoph Hellwig wrote:
> > For now just limited to blocksize == PAGE_SIZE, where we can simply read
> > in the full page in write begin, and just set the whole page dirty after
> > copying data into it.  This code is enabled by default and XFS will now
> > be feed pages without buffer heads in ->writepage and ->writepages.
> > 
> > If a file system sets the IOMAP_F_BUFFER_HEAD flag on the iomap the old
> > path will still be used, this both helps the transition in XFS and
> > prepares for the gfs2 migration to the iomap infrastructure.
> > 
> > Signed-off-by: Christoph Hellwig 
> > ---
> >  fs/iomap.c| 132 ++
> >  fs/xfs/xfs_iomap.c|   6 +-
> >  include/linux/iomap.h |   2 +
> >  3 files changed, 127 insertions(+), 13 deletions(-)
> > 
> > diff --git a/fs/iomap.c b/fs/iomap.c
> > index 821671af2618..cd4c563db80a 100644
> > --- a/fs/iomap.c
> > +++ b/fs/iomap.c
> > @@ -314,6 +314,58 @@ iomap_write_failed(struct inode *inode, loff_t pos, 
> > unsigned len)
> > truncate_pagecache_range(inode, max(pos, i_size), pos + len);
> >  }
> >  
> > +static int
> > +iomap_read_page_sync(struct inode *inode, loff_t block_start, struct page 
> > *page,
> > +   unsigned poff, unsigned plen, struct iomap *iomap)
> > +{
> > +   struct bio_vec bvec;
> > +   struct bio bio;
> > +   int ret;
> > +
> > +   bio_init(, , 1);
> > +   bio.bi_opf = REQ_OP_READ;
> > +   bio.bi_iter.bi_sector = iomap_sector(iomap, block_start);
> > +   bio_set_dev(, iomap->bdev);
> > +   __bio_add_page(, page, plen, poff);
> > +   ret = submit_bio_wait();
> > +   if (ret < 0 && iomap_block_needs_zeroing(inode, block_start, iomap))
> > +   zero_user(page, poff, plen);
> > +   return ret;
> > +}
> > +
> > +static int
> > +__iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
> > +   struct page *page, struct iomap *iomap)
> > +{
> > +   loff_t block_size = i_blocksize(inode);
> > +   loff_t block_start = pos & ~(block_size - 1);
> > +   loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
> > +   unsigned poff = block_start & (PAGE_SIZE - 1);
> > +   unsigned plen = min_t(loff_t, PAGE_SIZE - poff, block_end - 
> > block_start);
> > +   int status;
> > +
> > +   WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE);
> > +
> > +   if (PageUptodate(page))
> > +   return 0;
> > +
> > +   if (iomap_block_needs_zeroing(inode, block_start, iomap)) {
> > +   unsigned from = pos & (PAGE_SIZE - 1), to = from + len;
> > +   unsigned pend = poff + plen;
> > +
> > +   if (poff < from || pend > to)
> > +   zero_user_segments(page, poff, from, to, pend);
> > +   } else {
> > +   status = iomap_read_page_sync(inode, block_start, page,
> > +   poff, plen, iomap);
> 
> Something doesn't smell right here.  The only pages we need to read in
> are the first and last pages in the write_begin range, and only if they
> aren't page aligned and the underlying extent is IOMAP_MAPPED, right?

And not beyond EOF, too.

The bufferhead code handles this via the buffer_new() flag - it
triggers the skipping of read IO and the states in which it is
set are clearly indicated in iomap_to_bh(). That same logic needs to
apply here.

> I also noticed that speculative preallocation kicks in by the second 80M
> write() call and writeback for the second call can successfully allocate
> the entire preallocation, which means that the third (or nth) write call
> can have a real extent already mapped in, and then we end up reading it.

Yeah, that's because there's no check against EOF here. These writes
are all beyond EOF, so there shouldn't be any read at all...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Kent Overstreet

On Mon, May 21, 2018 at 02:24:32PM -0400, Mike Snitzer wrote:
> Every single data structure change in this series should be reviewed for
> unforeseen alignment consequences.  Jens seemed to say that is
> worthwhile.  Not sure if he'll do it or we divide it up.  If we divide
> it up a temp topic branch should be published for others to inspect.
> 
> Could be alignment hasn't been a historic concern for a bunch of the
> data structures changed in this series.. if so then all we can do is fix
> up any obvious potential for false sharing.

Honestly, I almost never worry about alignment... the very few times I do care,
I use __cacheline_aligned_in_smp.

If alignment is a concern in any of those structs, there really ought to be a
comment indicating it. I very much doubt anything I touched was performance
sensitive enough for it to be an issue, though. And if there is a performance
impact, it should be oughtweighed by the reduced pointer chasing.

If you disagree, I don't mind leaving the device mapper patch out, it really
makes no difference to me. I could glance over for alignment issues but I feel
like my analysis would not be terribly valuable to you considering I've already
said my position on alignment is "meh, don't care" :)

Re: [RFC PATCH 0/3] blk-mq: Timeout rework

2018-05-21 Thread Bart Van Assche

On Mon, 2018-05-21 at 17:11 -0600, Keith Busch wrote:
> The current blk-mq code potentially locks requests out of completion by
> the thousands, making drivers jump through hoops to handle them. This
> patch set allows drivers to complete their requests whenever they're
> completed without requiring drivers know anything about the timeout code
> with minimal syncronization.
> 
> Other proposals under current consideration still have moments that
> prevent a driver from progressing a request to the completed state.
> 
> The timeout is ultimatley made safe by reference counting the request
> when timeout handling claims the request. By holding the reference count,
> we don't need to do any tricks to prevent a driver from completing the
> request out from under the timeout handler, allowing the actual state
> to be changed inline with the true state, and drivers don't need to be
> aware any of this is happening.
> 
> In order to make the overhead as minimal as possible, the request's
> reference is taken only when it appears that actual timeout handling
> needs to be done.

Hello Keith,

Can you explain why the NVMe driver needs reference counting of requests but
no other block driver needs this? Additionally, why is it that for all block
drivers except NVMe the current block layer API is sufficient
(blk_get_request()/blk_execute_rq()/blk_mq_start_request()/
blk_mq_complete_request()/blk_mq_end_request())?

Thanks,

Bart.

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-05-21 Thread Bart Van Assche

On Mon, 2018-05-21 at 17:11 -0600, Keith Busch wrote:
> @@ -650,27 +600,10 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
>  void blk_mq_complete_request(struct request *rq)
>  {
>   struct request_queue *q = rq->q;
> - struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
> - int srcu_idx;
>  
>   if (unlikely(blk_should_fake_timeout(q)))
>   return;
> - [ ... ]
> + __blk_mq_complete_request(rq);
>  }
>  EXPORT_SYMBOL(blk_mq_complete_request);
> [ ... ]
>  static void blk_mq_rq_timed_out(struct request *req, bool reserved)
>  {
>   const struct blk_mq_ops *ops = req->q->mq_ops;
>   enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
>  
> - req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
> -
>   if (ops->timeout)
>   ret = ops->timeout(req, reserved);
>  
>   switch (ret) {
>   case BLK_EH_HANDLED:
> - __blk_mq_complete_request(req);
> - break;
> - case BLK_EH_RESET_TIMER:
>   /*
> -  * As nothing prevents from completion happening while
> -  * ->aborted_gstate is set, this may lead to ignored
> -  * completions and further spurious timeouts.
> +  * If the request is still in flight, the driver is requesting
> +  * blk-mq complete it.
>*/
> - blk_mq_rq_update_aborted_gstate(req, 0);
> + if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT)
> + __blk_mq_complete_request(req);
> + break;
> + case BLK_EH_RESET_TIMER:
>   blk_add_timer(req);
>   break;
>   case BLK_EH_NOT_HANDLED:
> @@ -880,64 +782,64 @@ static void blk_mq_rq_timed_out(struct request *req, 
> bool reserved)
>   }
>  }

I think the above changes can lead to concurrent calls of
__blk_mq_complete_request() from the regular completion path and the timeout
path. That's wrong: the __blk_mq_complete_request() caller should guarantee
that no concurrent calls from another context to that function can occur.

Bart.

Re: [PATCH 16/34] iomap: add initial support for writes without buffer heads

2018-05-21 Thread Darrick J. Wong

On Fri, May 18, 2018 at 06:48:12PM +0200, Christoph Hellwig wrote:
> For now just limited to blocksize == PAGE_SIZE, where we can simply read
> in the full page in write begin, and just set the whole page dirty after
> copying data into it.  This code is enabled by default and XFS will now
> be feed pages without buffer heads in ->writepage and ->writepages.
> 
> If a file system sets the IOMAP_F_BUFFER_HEAD flag on the iomap the old
> path will still be used, this both helps the transition in XFS and
> prepares for the gfs2 migration to the iomap infrastructure.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/iomap.c| 132 ++
>  fs/xfs/xfs_iomap.c|   6 +-
>  include/linux/iomap.h |   2 +
>  3 files changed, 127 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 821671af2618..cd4c563db80a 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -314,6 +314,58 @@ iomap_write_failed(struct inode *inode, loff_t pos, 
> unsigned len)
>   truncate_pagecache_range(inode, max(pos, i_size), pos + len);
>  }
>  
> +static int
> +iomap_read_page_sync(struct inode *inode, loff_t block_start, struct page 
> *page,
> + unsigned poff, unsigned plen, struct iomap *iomap)
> +{
> + struct bio_vec bvec;
> + struct bio bio;
> + int ret;
> +
> + bio_init(, , 1);
> + bio.bi_opf = REQ_OP_READ;
> + bio.bi_iter.bi_sector = iomap_sector(iomap, block_start);
> + bio_set_dev(, iomap->bdev);
> + __bio_add_page(, page, plen, poff);
> + ret = submit_bio_wait();
> + if (ret < 0 && iomap_block_needs_zeroing(inode, block_start, iomap))
> + zero_user(page, poff, plen);
> + return ret;
> +}
> +
> +static int
> +__iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
> + struct page *page, struct iomap *iomap)
> +{
> + loff_t block_size = i_blocksize(inode);
> + loff_t block_start = pos & ~(block_size - 1);
> + loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
> + unsigned poff = block_start & (PAGE_SIZE - 1);
> + unsigned plen = min_t(loff_t, PAGE_SIZE - poff, block_end - 
> block_start);
> + int status;
> +
> + WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE);
> +
> + if (PageUptodate(page))
> + return 0;
> +
> + if (iomap_block_needs_zeroing(inode, block_start, iomap)) {
> + unsigned from = pos & (PAGE_SIZE - 1), to = from + len;
> + unsigned pend = poff + plen;
> +
> + if (poff < from || pend > to)
> + zero_user_segments(page, poff, from, to, pend);
> + } else {
> + status = iomap_read_page_sync(inode, block_start, page,
> + poff, plen, iomap);

Something doesn't smell right here.  The only pages we need to read in
are the first and last pages in the write_begin range, and only if they
aren't page aligned and the underlying extent is IOMAP_MAPPED, right?

I also noticed that speculative preallocation kicks in by the second 80M
write() call and writeback for the second call can successfully allocate
the entire preallocation, which means that the third (or nth) write call
can have a real extent already mapped in, and then we end up reading it.

--D

> + if (status < 0)
> + return status;
> + SetPageUptodate(page);
> + }
> +
> + return 0;
> +}
> +
>  static int
>  iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned 
> flags,
>   struct page **pagep, struct iomap *iomap)
> @@ -331,7 +383,10 @@ iomap_write_begin(struct inode *inode, loff_t pos, 
> unsigned len, unsigned flags,
>   if (!page)
>   return -ENOMEM;
>  
> - status = __block_write_begin_int(page, pos, len, NULL, iomap);
> + if (iomap->flags & IOMAP_F_BUFFER_HEAD)
> + status = __block_write_begin_int(page, pos, len, NULL, iomap);
> + else
> + status = __iomap_write_begin(inode, pos, len, page, iomap);
>   if (unlikely(status)) {
>   unlock_page(page);
>   put_page(page);
> @@ -344,14 +399,63 @@ iomap_write_begin(struct inode *inode, loff_t pos, 
> unsigned len, unsigned flags,
>   return status;
>  }
>  
> +int
> +iomap_set_page_dirty(struct page *page)
> +{
> + struct address_space *mapping = page_mapping(page);
> + int newly_dirty;
> +
> + if (unlikely(!mapping))
> + return !TestSetPageDirty(page);
> +
> + /*
> +  * Lock out page->mem_cgroup migration to keep PageDirty
> +  * synchronized with per-memcg dirty page counters.
> +  */
> + lock_page_memcg(page);
> + newly_dirty = !TestSetPageDirty(page);
> + if (newly_dirty)
> + __set_page_dirty(page, mapping, 0);
> + unlock_page_memcg(page);
> +
> + if (newly_dirty)
> + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +

[RFC PATCH 0/3] blk-mq: Timeout rework

2018-05-21 Thread Keith Busch

The current blk-mq code potentially locks requests out of completion by
the thousands, making drivers jump through hoops to handle them. This
patch set allows drivers to complete their requests whenever they're
completed without requiring drivers know anything about the timeout code
with minimal syncronization.

Other proposals under current consideration still have moments that
prevent a driver from progressing a request to the completed state.

The timeout is ultimatley made safe by reference counting the request
when timeout handling claims the request. By holding the reference count,
we don't need to do any tricks to prevent a driver from completing the
request out from under the timeout handler, allowing the actual state
to be changed inline with the true state, and drivers don't need to be
aware any of this is happening.

In order to make the overhead as minimal as possible, the request's
reference is taken only when it appears that actual timeout handling
needs to be done.

Keith Busch (3):
  blk-mq: Reference count request usage
  blk-mq: Fix timeout and state order
  blk-mq: Remove generation seqeunce

 block/blk-core.c   |   6 -
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c | 291 +
 block/blk-mq.h |  20 +---
 block/blk-timeout.c|   1 -
 include/linux/blkdev.h |  26 +
 6 files changed, 83 insertions(+), 262 deletions(-)

-- 
2.14.3

[RFC PATCH 2/3] blk-mq: Fix timeout and state order

2018-05-21 Thread Keith Busch

The block layer had been setting the state to in-flight prior to updating
the timer. This is the wrong order since the timeout handler could observe
the in-flight state with the older timeout, believing the request had
expired when in fact it is just getting started.

Signed-off-by: Keith Busch 
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8b370ed75605..66e5c768803f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -713,8 +713,8 @@ void blk_mq_start_request(struct request *rq)
preempt_disable();
write_seqcount_begin(>gstate_seq);
 
-   blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
blk_add_timer(rq);
+   blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
 
write_seqcount_end(>gstate_seq);
preempt_enable();
-- 
2.14.3

[RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-05-21 Thread Keith Busch

This patch simplifies the timeout handling by relying on the request
reference counting to ensure the iterator is operating on an inflight
and truly timed out request. Since the reference counting prevents the
tag from being reallocated, the block layer no longer needs to prevent
drivers from completing their requests while the timeout handler is
operating on it: a driver completing a request is allowed to proceed to
the next state without additional syncronization with the block layer.

This also removes any need for generation sequence numbers since the
request lifetime is prevented from being reallocated as a new sequence
while timeout handling is operating on it.

Signed-off-by: Keith Busch 
---
 block/blk-core.c   |   6 --
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c | 259 ++---
 block/blk-mq.h |  20 +---
 block/blk-timeout.c|   1 -
 include/linux/blkdev.h |  26 +
 6 files changed, 58 insertions(+), 255 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 43370faee935..cee03cad99f2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request 
*rq)
rq->internal_tag = -1;
rq->start_time_ns = ktime_get_ns();
rq->part = NULL;
-   seqcount_init(>gstate_seq);
-   u64_stats_init(>aborted_gstate_sync);
-   /*
-* See comment of blk_mq_init_request
-*/
-   WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 3080e18cb859..ffa622366922 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -344,7 +344,6 @@ static const char *const rqf_name[] = {
RQF_NAME(STATS),
RQF_NAME(SPECIAL_PAYLOAD),
RQF_NAME(ZONE_WRITE_LOCKED),
-   RQF_NAME(MQ_TIMEOUT_EXPIRED),
RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 66e5c768803f..4858876fd364 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -589,56 +589,6 @@ static void __blk_mq_complete_request(struct request *rq)
put_cpu();
 }
 
-static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
-   __releases(hctx->srcu)
-{
-   if (!(hctx->flags & BLK_MQ_F_BLOCKING))
-   rcu_read_unlock();
-   else
-   srcu_read_unlock(hctx->srcu, srcu_idx);
-}
-
-static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
-   __acquires(hctx->srcu)
-{
-   if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
-   /* shut up gcc false positive */
-   *srcu_idx = 0;
-   rcu_read_lock();
-   } else
-   *srcu_idx = srcu_read_lock(hctx->srcu);
-}
-
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
-{
-   unsigned long flags;
-
-   /*
-* blk_mq_rq_aborted_gstate() is used from the completion path and
-* can thus be called from irq context.  u64_stats_fetch in the
-* middle of update on the same CPU leads to lockup.  Disable irq
-* while updating.
-*/
-   local_irq_save(flags);
-   u64_stats_update_begin(>aborted_gstate_sync);
-   rq->aborted_gstate = gstate;
-   u64_stats_update_end(>aborted_gstate_sync);
-   local_irq_restore(flags);
-}
-
-static u64 blk_mq_rq_aborted_gstate(struct request *rq)
-{
-   unsigned int start;
-   u64 aborted_gstate;
-
-   do {
-   start = u64_stats_fetch_begin(>aborted_gstate_sync);
-   aborted_gstate = rq->aborted_gstate;
-   } while (u64_stats_fetch_retry(>aborted_gstate_sync, start));
-
-   return aborted_gstate;
-}
-
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:the request being processed
@@ -650,27 +600,10 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 void blk_mq_complete_request(struct request *rq)
 {
struct request_queue *q = rq->q;
-   struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
-   int srcu_idx;
 
if (unlikely(blk_should_fake_timeout(q)))
return;
-
-   /*
-* If @rq->aborted_gstate equals the current instance, timeout is
-* claiming @rq and we lost.  This is synchronized through
-* hctx_lock().  See blk_mq_timeout_work() for details.
-*
-* Completion path never blocks and we can directly use RCU here
-* instead of hctx_lock() which can be either RCU or SRCU.
-* However, that would complicate paths which want to synchronize
-* against us.  Let stay in sync with the issue path so that
-* hctx_lock() covers both issue and completion paths.
-*/
-   hctx_lock(hctx, _idx);
-   if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
-   __blk_mq_complete_request(rq);
-   hctx_unlock(hctx, srcu_idx);
+

[RFC PATCH 1/3] blk-mq: Reference count request usage

2018-05-21 Thread Keith Busch

This patch adds a struct kref to struct request so that request users
can be sure they're operating on the same request without it changing
while they're processing it. The request's tag won't be released for
reuse until the last user is done with it.

Signed-off-by: Keith Busch 
---
 block/blk-mq.c | 30 +++---
 include/linux/blkdev.h |  2 ++
 2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4cbfd784e837..8b370ed75605 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -332,6 +332,7 @@ static struct request *blk_mq_rq_ctx_init(struct 
blk_mq_alloc_data *data,
 #endif
 
data->ctx->rq_dispatched[op_is_sync(op)]++;
+   kref_init(>ref);
return rq;
 }
 
@@ -465,13 +466,33 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
 
+static void blk_mq_exit_request(struct kref *ref)
+{
+   struct request *rq = container_of(ref, struct request, ref);
+   struct request_queue *q = rq->q;
+   struct blk_mq_ctx *ctx = rq->mq_ctx;
+   struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+   const int sched_tag = rq->internal_tag;
+
+   if (rq->tag != -1)
+   blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
+   if (sched_tag != -1)
+   blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
+   blk_mq_sched_restart(hctx);
+   blk_queue_exit(q);
+}
+
+static void blk_mq_put_request(struct request *rq)
+{
+   kref_put(>ref, blk_mq_exit_request);
+}
+
 void blk_mq_free_request(struct request *rq)
 {
struct request_queue *q = rq->q;
struct elevator_queue *e = q->elevator;
struct blk_mq_ctx *ctx = rq->mq_ctx;
struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
-   const int sched_tag = rq->internal_tag;
 
if (rq->rq_flags & RQF_ELVPRIV) {
if (e && e->type->ops.mq.finish_request)
@@ -495,12 +516,7 @@ void blk_mq_free_request(struct request *rq)
blk_put_rl(blk_rq_rl(rq));
 
blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
-   if (rq->tag != -1)
-   blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
-   if (sched_tag != -1)
-   blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
-   blk_mq_sched_restart(hctx);
-   blk_queue_exit(q);
+   blk_mq_put_request(rq);
 }
 EXPORT_SYMBOL_GPL(blk_mq_free_request);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f3999719f828..26bf2c1e3502 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -257,6 +257,8 @@ struct request {
struct u64_stats_sync aborted_gstate_sync;
u64 aborted_gstate;
 
+   struct kref ref;
+
/* access through blk_rq_set_deadline, blk_rq_deadline */
unsigned long __deadline;
 
-- 
2.14.3

Re: INFO: task hung in blk_queue_enter

2018-05-21 Thread Tetsuo Handa

Bart Van Assche wrote:
> On Wed, 2018-05-16 at 17:16 +0200, Dmitry Vyukov wrote:
> > On Wed, May 16, 2018 at 4:56 PM, Bart Van Assche  
> > wrote:
> > > On Wed, 2018-05-16 at 22:05 +0900, Tetsuo Handa wrote:
> > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > index 85909b4..59e2496 100644
> > > > --- a/block/blk-core.c
> > > > +++ b/block/blk-core.c
> > > > @@ -951,10 +951,10 @@ int blk_queue_enter(struct request_queue *q, 
> > > > blk_mq_req_flags_t flags)
> > > >   smp_rmb();
> > > > 
> > > >   wait_event(q->mq_freeze_wq,
> > > > -(atomic_read(>mq_freeze_depth) == 0 &&
> > > > - (preempt || !blk_queue_preempt_only(q))) ||
> > > > +atomic_read(>mq_freeze_depth) ||
> > > > +(preempt || !blk_queue_preempt_only(q)) ||
> > > >  blk_queue_dying(q));
> > > > - if (blk_queue_dying(q))
> > > > + if (atomic_read(>mq_freeze_depth) || 
> > > > blk_queue_dying(q))
> > > >   return -ENODEV;
> > > >   }
> > > >  }
> > > 
> > > That change looks wrong to me.
> > 
> > Hi Bart,
> > 
> > Why does it look wrong to you?
> 
> Because that change conflicts with the purpose of queue freezing and also 
> because
> that change would inject I/O errors in code paths that shouldn't inject I/O 
> errors.

But waiting there until atomic_read(>mq_freeze_depth) becomes 0 is causing 
deadlock.
wait_event() never returns is a bug. I think we should not wait for 
q->mq_freeze_depth.

Re: [PATCH v5 4/5] fs: blkdev set bio prio from kiocb prio

2018-05-21 Thread Adam Manzanares



On 5/21/18 2:04 PM, Jeff Moyer wrote:
> adam.manzana...@wdc.com writes:
> 
>> From: Adam Manzanares 
>>
>> Now that kiocb has an ioprio field copy this over to the bio when it is
>> created from the kiocb.
>>
>> Signed-off-by: Adam Manzanares 
>> ---
>>   fs/block_dev.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 7ec920e27065..da1e94d2bb75 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -355,6 +355,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
>> *iter, int nr_pages)
>>  bio->bi_write_hint = iocb->ki_hint;
>>  bio->bi_private = dio;
>>  bio->bi_end_io = blkdev_bio_end_io;
>> +bio->bi_ioprio = iocb->ki_ioprio;
>>   
>>  ret = bio_iov_iter_get_pages(bio, iter);
>>  if (unlikely(ret)) {
> 
> Forgot to mention, you should also convert __blkdev_direct_IO_simple.

NP.

> 
> -Jeff
>

Re: [PATCH v5 0/5] AIO add per-command iopriority

2018-05-21 Thread Adam Manzanares



On 5/21/18 1:57 PM, Jeff Moyer wrote:
> Hi, Adam,
> 
> adam.manzana...@wdc.com writes:
> 
>> From: Adam Manzanares 
>>
>> This is the per-I/O equivalent of the ioprio_set system call.
>> See the following link for performance implications on a SATA HDD:
>> https://lkml.org/lkml/2016/12/6/495
>>
>> First patch factors ioprio_check_cap function out of ioprio_set system call 
>> to
>> also be used by the aio ioprio interface.
>>
>> Second patch converts kiocb ki_hint field to a u16 to avoid kiocb bloat.
>>
>> Third patch passes ioprio hint from aio iocb to kiocb and inititalizes kiocb
>> ioprio value appropriately when it is not explicitly set.
>>
>> Fourth patch enables the feature for blkdev.
>>
>> Fifth patch enables the feature for iomap direct IO
>>
>> Note: this work is based on top of linux-vfs/for-next
> 
> I'll cook up a libaio test case.  Can you put together a man-pages
> update for this?

Hello Jeff,

I'll send out a man-pages update soon.

Thanks,
Adam

> 
> Thanks!
> Jeff
>

Re: [PATCH v5 4/5] fs: blkdev set bio prio from kiocb prio

2018-05-21 Thread Jeff Moyer

adam.manzana...@wdc.com writes:

> From: Adam Manzanares 
>
> Now that kiocb has an ioprio field copy this over to the bio when it is
> created from the kiocb.
>
> Signed-off-by: Adam Manzanares 
> ---
>  fs/block_dev.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 7ec920e27065..da1e94d2bb75 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -355,6 +355,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, int nr_pages)
>   bio->bi_write_hint = iocb->ki_hint;
>   bio->bi_private = dio;
>   bio->bi_end_io = blkdev_bio_end_io;
> + bio->bi_ioprio = iocb->ki_ioprio;
>  
>   ret = bio_iov_iter_get_pages(bio, iter);
>   if (unlikely(ret)) {

Forgot to mention, you should also convert __blkdev_direct_IO_simple.

-Jeff

Re: [PATCH v5 1/5] block: add ioprio_check_cap function

2018-05-21 Thread Jeff Moyer

adam.manzana...@wdc.com writes:

> From: Adam Manzanares 
>
> Aio per command iopriority support introduces a second interface between
> userland and the kernel capable of passing iopriority. The aio interface also
> needs the ability to verify that the submitting context has sufficient
> priviledges to submit IOPRIO_RT commands. This patch creates the
> ioprio_check_cap function to be used by the ioprio_set system call and also by
> the aio interface.
>
> Signed-off-by: Adam Manzanares 
> Reviewed-by: Christoph Hellwig 

Reviewed-by: Jeff Moyer 


> ---
>  block/ioprio.c | 22 --
>  include/linux/ioprio.h |  2 ++
>  2 files changed, 18 insertions(+), 6 deletions(-)
>
> diff --git a/block/ioprio.c b/block/ioprio.c
> index 6f5d0b6625e3..f9821080c92c 100644
> --- a/block/ioprio.c
> +++ b/block/ioprio.c
> @@ -61,15 +61,10 @@ int set_task_ioprio(struct task_struct *task, int ioprio)
>  }
>  EXPORT_SYMBOL_GPL(set_task_ioprio);
>  
> -SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio)
> +int ioprio_check_cap(int ioprio)
>  {
>   int class = IOPRIO_PRIO_CLASS(ioprio);
>   int data = IOPRIO_PRIO_DATA(ioprio);
> - struct task_struct *p, *g;
> - struct user_struct *user;
> - struct pid *pgrp;
> - kuid_t uid;
> - int ret;
>  
>   switch (class) {
>   case IOPRIO_CLASS_RT:
> @@ -92,6 +87,21 @@ SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, 
> ioprio)
>   return -EINVAL;
>   }
>  
> + return 0;
> +}
> +
> +SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio)
> +{
> + struct task_struct *p, *g;
> + struct user_struct *user;
> + struct pid *pgrp;
> + kuid_t uid;
> + int ret;
> +
> + ret = ioprio_check_cap(ioprio);
> + if (ret)
> + return ret;
> +
>   ret = -ESRCH;
>   rcu_read_lock();
>   switch (which) {
> diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
> index 627efac73e6d..4a28cec49ec3 100644
> --- a/include/linux/ioprio.h
> +++ b/include/linux/ioprio.h
> @@ -77,4 +77,6 @@ extern int ioprio_best(unsigned short aprio, unsigned short 
> bprio);
>  
>  extern int set_task_ioprio(struct task_struct *task, int ioprio);
>  
> +extern int ioprio_check_cap(int ioprio);
> +
>  #endif

Re: [PATCH v5 5/5] fs: iomap dio set bio prio from kiocb prio

2018-05-21 Thread Jeff Moyer

adam.manzana...@wdc.com writes:

> From: Adam Manzanares 
>
> Now that kiocb has an ioprio field copy this over to the bio when it is
> created from the kiocb during direct IO.
>
> Signed-off-by: Adam Manzanares 

Reviewed-by: Jeff Moyer 


> ---
>  fs/iomap.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/iomap.c b/fs/iomap.c
> index afd163586aa0..65aae194aeca 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -919,6 +919,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t 
> length,
>   bio->bi_iter.bi_sector =
>   (iomap->addr + pos - iomap->offset) >> 9;
>   bio->bi_write_hint = dio->iocb->ki_hint;
> + bio->bi_ioprio = dio->iocb->ki_ioprio;
>   bio->bi_private = dio;
>   bio->bi_end_io = iomap_dio_bio_end_io;

Re: [PATCH v5 4/5] fs: blkdev set bio prio from kiocb prio

2018-05-21 Thread Jeff Moyer

adam.manzana...@wdc.com writes:

> From: Adam Manzanares 
>
> Now that kiocb has an ioprio field copy this over to the bio when it is
> created from the kiocb.
>
> Signed-off-by: Adam Manzanares 

Reviewed-by: Jeff Moyer 


> ---
>  fs/block_dev.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 7ec920e27065..da1e94d2bb75 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -355,6 +355,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, int nr_pages)
>   bio->bi_write_hint = iocb->ki_hint;
>   bio->bi_private = dio;
>   bio->bi_end_io = blkdev_bio_end_io;
> + bio->bi_ioprio = iocb->ki_ioprio;
>  
>   ret = bio_iov_iter_get_pages(bio, iter);
>   if (unlikely(ret)) {

Re: [PATCH v5 3/5] fs: Add aio iopriority support

2018-05-21 Thread Jeff Moyer

adam.manzana...@wdc.com writes:

> From: Adam Manzanares 
>
> This is the per-I/O equivalent of the ioprio_set system call.
>
> When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the
> newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field.
>
> We set the blkdev bio iopriority unconditionally, so we need to guarantee the 
> kiocb is initialized properly. Added changes to the loopback driver and 
> init_sync_kiocb to achieve this.
>
> This patch depends on block: add ioprio_check_cap function.
>
> Signed-off-by: Adam Manzanares 

Reviewed-by: Jeff Moyer 


> ---
>  drivers/block/loop.c |  3 +++
>  fs/aio.c | 16 
>  include/linux/fs.h   |  3 +++
>  include/uapi/linux/aio_abi.h |  1 +
>  4 files changed, 23 insertions(+)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 5d4e31655d96..dd98dfd97f5e 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -76,6 +76,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +
>  #include "loop.h"
>  
>  #include 
> @@ -559,6 +561,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> loop_cmd *cmd,
>   cmd->iocb.ki_filp = file;
>   cmd->iocb.ki_complete = lo_rw_aio_complete;
>   cmd->iocb.ki_flags = IOCB_DIRECT;
> + cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
>   if (cmd->css)
>   kthread_associate_blkcg(cmd->css);
>  
> diff --git a/fs/aio.c b/fs/aio.c
> index f3eae5d5771b..44b4572be524 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1451,6 +1451,22 @@ static int aio_prep_rw(struct kiocb *req, struct iocb 
> *iocb)
>   if (iocb->aio_flags & IOCB_FLAG_RESFD)
>   req->ki_flags |= IOCB_EVENTFD;
>   req->ki_hint = file_write_hint(req->ki_filp);
> + if (iocb->aio_flags & IOCB_FLAG_IOPRIO) {
> + /*
> +  * If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then
> +  * aio_reqprio is interpreted as an I/O scheduling
> +  * class and priority.
> +  */
> + ret = ioprio_check_cap(iocb->aio_reqprio);
> + if (ret) {
> + pr_debug("aio ioprio check cap error\n");
> + return -EINVAL;
> + }
> +
> + req->ki_ioprio = iocb->aio_reqprio;
> + } else
> + req->ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
> +
>   ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags);
>   if (unlikely(ret))
>   fput(req->ki_filp);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 50de40dbbb85..73b749ed3ea1 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -36,6 +36,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -302,6 +303,7 @@ struct kiocb {
>   void*private;
>   int ki_flags;
>   u16 ki_hint;
> + u16 ki_ioprio; /* See linux/ioprio.h */
>  } __randomize_layout;
>  
>  static inline bool is_sync_kiocb(struct kiocb *kiocb)
> @@ -1942,6 +1944,7 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, 
> struct file *filp)
>   .ki_filp = filp,
>   .ki_flags = iocb_flags(filp),
>   .ki_hint = ki_hint_validate(file_write_hint(filp)),
> + .ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0),
>   };
>  }
>  
> diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
> index 2c0a3415beee..d4e768d55d14 100644
> --- a/include/uapi/linux/aio_abi.h
> +++ b/include/uapi/linux/aio_abi.h
> @@ -55,6 +55,7 @@ enum {
>   *   is valid.
>   */
>  #define IOCB_FLAG_RESFD  (1 << 0)
> +#define IOCB_FLAG_IOPRIO (1 << 1)
>  
>  /* read() from /dev/aio returns these structures. */
>  struct io_event {

Re: [PATCH v5 0/5] AIO add per-command iopriority

2018-05-21 Thread Jeff Moyer

Hi, Adam,

adam.manzana...@wdc.com writes:

> From: Adam Manzanares 
>
> This is the per-I/O equivalent of the ioprio_set system call.
> See the following link for performance implications on a SATA HDD:
> https://lkml.org/lkml/2016/12/6/495
>
> First patch factors ioprio_check_cap function out of ioprio_set system call to
> also be used by the aio ioprio interface.
>
> Second patch converts kiocb ki_hint field to a u16 to avoid kiocb bloat.
>
> Third patch passes ioprio hint from aio iocb to kiocb and inititalizes kiocb
> ioprio value appropriately when it is not explicitly set.
>
> Fourth patch enables the feature for blkdev.
>
> Fifth patch enables the feature for iomap direct IO
>
> Note: this work is based on top of linux-vfs/for-next

I'll cook up a libaio test case.  Can you put together a man-pages
update for this?

Thanks!
Jeff

Re: buffered I/O without buffer heads in xfs and iomap v2

2018-05-21 Thread Darrick J. Wong

On Fri, May 18, 2018 at 06:47:56PM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series adds support for buffered I/O without buffer heads to
> the iomap and XFS code.
> 
> For now this series only contains support for block size == PAGE_SIZE,
> with the 4k support split into a separate series.
> 
> 
> A git tree is available at:
> 
> git://git.infradead.org/users/hch/xfs.git xfs-iomap-read.2
> 
> Gitweb:
> 
> 
> http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-iomap-read.2

Hmm, so I pulled this and ran my trivial stupid benchmark on for-next.
It's a stupid VM with a 2G of RAM and a 12GB virtio-scsi disk backed by
tmpfs:

# mkfs.xfs -f -m rmapbt=0,reflink=1 /dev/sda
meta-data=/dev/sda   isize=512agcount=4, agsize=823296
blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=1finobt=1, sparse=1,
rmapbt=1
 =   reflink=1
data =   bsize=4096   blocks=3293184, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
log  =internal log   bsize=4096   blocks=3693, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0
# mount /dev/sda /mnt
# xfs_io -f -c 'pwrite -W -S 0x64 -b 83886080 0 734003200' /mnt/a
wrote 734003200/734003200 bytes at offset 0
700 MiB, 9 ops; 0:00:01.06 (655.500 MiB/sec and 8.4279 ops/sec)
# cp --reflink=always /mnt/a /mnt/b
# xfs_io -f -c 'pwrite -W -S 0x65 -b 83886080 0 734003200' /mnt/b
wrote 734003200/734003200 bytes at offset 0
700 MiB, 9 ops; 0.9620 sec (727.615 MiB/sec and 9.3551 ops/sec)

Then I applied your series (not including the blocksize < pagesize
series) and saw this big regression:

# mkfs.xfs -f -m rmapbt=0,reflink=1 /dev/sda
meta-data=/dev/sda   isize=512agcount=4, agsize=823296
blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=1finobt=1, sparse=1,
rmapbt=1
 =   reflink=1
data =   bsize=4096   blocks=3293184, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
log  =internal log   bsize=4096   blocks=3693, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0
# mount /dev/sda /mnt
# xfs_io -f -c 'pwrite -W -S 0x64 -b 83886080 0 734003200' /mnt/a
wrote 734003200/734003200 bytes at offset 0
700 MiB, 9 ops; 0:00:08.04 (87.031 MiB/sec and 1.1190 ops/sec)
# cp --reflink=always /mnt/a /mnt/b
# xfs_io -f -c 'pwrite -W -S 0x65 -b 83886080 0 734003200' /mnt/b
wrote 734003200/734003200 bytes at offset 0
700 MiB, 9 ops; 0:00:21.61 (32.389 MiB/sec and 0.4164 ops/sec)

I'll see if I can spot the problem while I read through the v2 code...

--D

> 
> Changes since v1:
>  - fix the iomap_readpages error handling
>  - use unsigned file offsets in a few places to avoid arithmetic overflows
>  - allocate a iomap_page in iomap_page_mkwrite to fix generic/095
>  - improve a few comments
>  - add more asserts
>  - warn about truncated block numbers from ->bmap
>  - new patch to change the __do_page_cache_readahead return value to
>unsigned int
>  - remove an incorrectly added empty line
>  - make inline data an explicit iomap type instead of a flag
>  - add a IOMAP_F_BUFFER_HEAD flag to force use of buffers heads for gfs2,
>and keep the basic buffer head infrastructure around for now.

Re: [PATCH 05/34] fs: use ->is_partially_uptodate in page_cache_seek_hole_data

2018-05-21 Thread Darrick J. Wong

On Fri, May 18, 2018 at 06:48:01PM +0200, Christoph Hellwig wrote:
> This way the implementation doesn't depend on buffer_head internals.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/iomap.c | 83 +++---
>  1 file changed, 42 insertions(+), 41 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index bef5e91d40bf..0fecd5789d7b 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -594,31 +594,54 @@ EXPORT_SYMBOL_GPL(iomap_fiemap);
>   *
>   * Returns the offset within the file on success, and -ENOENT otherwise.

This comment is now wrong, since we return the offset via *lastoff and I
think the return value is whether or not we found what we were looking
for...?

>   */
> -static loff_t
> -page_seek_hole_data(struct page *page, loff_t lastoff, int whence)
> +static bool
> +page_seek_hole_data(struct inode *inode, struct page *page, loff_t *lastoff,
> + int whence)
>  {
> - loff_t offset = page_offset(page);
> - struct buffer_head *bh, *head;
> + const struct address_space_operations *ops = inode->i_mapping->a_ops;
> + unsigned int bsize = i_blocksize(inode), off;
>   bool seek_data = whence == SEEK_DATA;
> + loff_t poff = page_offset(page);
>  
> - if (lastoff < offset)
> - lastoff = offset;
> -
> - bh = head = page_buffers(page);
> - do {
> - offset += bh->b_size;
> - if (lastoff >= offset)
> - continue;
> + if (WARN_ON_ONCE(*lastoff >= poff + PAGE_SIZE))
> + return false;
>  
> + if (*lastoff < poff) {
>   /*
> -  * Any buffer with valid data in it should have BH_Uptodate set.
> +  * Last offset smaller than the start of the page means we found
> +  * a hole:
>*/
> - if (buffer_uptodate(bh) == seek_data)
> - return lastoff;
> + if (whence == SEEK_HOLE)
> + return true;
> + *lastoff = poff;
> + }
>  
> - lastoff = offset;
> - } while ((bh = bh->b_this_page) != head);
> - return -ENOENT;
> + /*
> +  * Just check the page unless we can and should check block ranges:
> +  */
> + if (bsize == PAGE_SIZE || !ops->is_partially_uptodate) {
> + if (PageUptodate(page) == seek_data)
> + return true;
> + return false;

return PageUptodate(page) == seek_data; ?

--D

> + }
> +
> + lock_page(page);
> + if (unlikely(page->mapping != inode->i_mapping))
> + goto out_unlock_not_found;
> +
> + for (off = 0; off < PAGE_SIZE; off += bsize) {
> + if ((*lastoff & ~PAGE_MASK) >= off + bsize)
> + continue;
> + if (ops->is_partially_uptodate(page, off, bsize) == seek_data) {
> + unlock_page(page);
> + return true;
> + }
> + *lastoff = poff + off + bsize;
> + }
> +
> +out_unlock_not_found:
> + unlock_page(page);
> + return false;
>  }
>  
>  /*
> @@ -655,30 +678,8 @@ page_cache_seek_hole_data(struct inode *inode, loff_t 
> offset, loff_t length,
>   for (i = 0; i < nr_pages; i++) {
>   struct page *page = pvec.pages[i];
>  
> - /*
> -  * At this point, the page may be truncated or
> -  * invalidated (changing page->mapping to NULL), or
> -  * even swizzled back from swapper_space to tmpfs file
> -  * mapping.  However, page->index will not change
> -  * because we have a reference on the page.
> - *
> -  * If current page offset is beyond where we've ended,
> -  * we've found a hole.
> - */
> - if (whence == SEEK_HOLE &&
> - lastoff < page_offset(page))
> + if (page_seek_hole_data(inode, page, , whence))
>   goto check_range;
> -
> - lock_page(page);
> - if (likely(page->mapping == inode->i_mapping) &&
> - page_has_buffers(page)) {
> - lastoff = page_seek_hole_data(page, lastoff, 
> whence);
> - if (lastoff >= 0) {
> - unlock_page(page);
> - goto check_range;
> - }
> - }
> - unlock_page(page);
>   lastoff = page_offset(page) + PAGE_SIZE;
>   }
>   pagevec_release();
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at

Re: [PATCH 00/10] Misc block layer patches for bcachefs

2018-05-21 Thread Bart Van Assche

On Mon, 2018-05-21 at 11:37 -0700, Omar Sandoval wrote:
> Have you made any progress in porting srp-test to blktests so we don't
> have to have this conversation again?

Hello Omar,

Porting the srp-test software to the blktests framework is still high on my
to-do list. I will start working on this as soon as the work on adding SMR
support for fio has been completed.

Bart.

Re: [PATCH 11/12] xfs: convert to bioset_init()/mempool_init()

2018-05-21 Thread Darrick J. Wong

On Sun, May 20, 2018 at 06:25:57PM -0400, Kent Overstreet wrote:
> Signed-off-by: Kent Overstreet 

Looks ok, I guess...
Acked-by: Darrick J. Wong 

--D

> ---
>  fs/xfs/xfs_aops.c  |  2 +-
>  fs/xfs/xfs_aops.h  |  2 +-
>  fs/xfs/xfs_super.c | 11 +--
>  3 files changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 0ab824f574..102463543d 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -594,7 +594,7 @@ xfs_alloc_ioend(
>   struct xfs_ioend*ioend;
>   struct bio  *bio;
>  
> - bio = bio_alloc_bioset(GFP_NOFS, BIO_MAX_PAGES, xfs_ioend_bioset);
> + bio = bio_alloc_bioset(GFP_NOFS, BIO_MAX_PAGES, _ioend_bioset);
>   xfs_init_bio_from_bh(bio, bh);
>  
>   ioend = container_of(bio, struct xfs_ioend, io_inline_bio);
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index 69346d460d..694c85b038 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -18,7 +18,7 @@
>  #ifndef __XFS_AOPS_H__
>  #define __XFS_AOPS_H__
>  
> -extern struct bio_set *xfs_ioend_bioset;
> +extern struct bio_set xfs_ioend_bioset;
>  
>  /*
>   * Types of I/O for bmap clustering and I/O completion tracking.
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index d714240529..f643d76db5 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -63,7 +63,7 @@
>  #include 
>  
>  static const struct super_operations xfs_super_operations;
> -struct bio_set *xfs_ioend_bioset;
> +struct bio_set xfs_ioend_bioset;
>  
>  static struct kset *xfs_kset;/* top-level xfs sysfs dir */
>  #ifdef DEBUG
> @@ -1845,10 +1845,9 @@ MODULE_ALIAS_FS("xfs");
>  STATIC int __init
>  xfs_init_zones(void)
>  {
> - xfs_ioend_bioset = bioset_create(4 * MAX_BUF_PER_PAGE,
> + if (bioset_init(_ioend_bioset, 4 * MAX_BUF_PER_PAGE,
>   offsetof(struct xfs_ioend, io_inline_bio),
> - BIOSET_NEED_BVECS);
> - if (!xfs_ioend_bioset)
> + BIOSET_NEED_BVECS))
>   goto out;
>  
>   xfs_log_ticket_zone = kmem_zone_init(sizeof(xlog_ticket_t),
> @@ -1997,7 +1996,7 @@ xfs_init_zones(void)
>   out_destroy_log_ticket_zone:
>   kmem_zone_destroy(xfs_log_ticket_zone);
>   out_free_ioend_bioset:
> - bioset_free(xfs_ioend_bioset);
> + bioset_exit(_ioend_bioset);
>   out:
>   return -ENOMEM;
>  }
> @@ -2029,7 +2028,7 @@ xfs_destroy_zones(void)
>   kmem_zone_destroy(xfs_btree_cur_zone);
>   kmem_zone_destroy(xfs_bmap_free_item_zone);
>   kmem_zone_destroy(xfs_log_ticket_zone);
> - bioset_free(xfs_ioend_bioset);
> + bioset_exit(_ioend_bioset);
>  }
>  
>  STATIC int __init
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/10] Misc block layer patches for bcachefs

2018-05-21 Thread Omar Sandoval

On Mon, May 21, 2018 at 03:11:08PM +, Bart Van Assche wrote:
> On Sun, 2018-05-20 at 19:58 -0400, Kent Overstreet wrote:
> > On Sun, May 20, 2018 at 11:40:45PM +, Bart Van Assche wrote:
> > > On Sun, 2018-05-20 at 19:21 -0400, Kent Overstreet wrote:
> > > > I really have better things to do than debug someone else's tests...
> > > > [ ... ]
> > > > ../run_tests: line 65: cd: /lib/modules/4.16.0+/kernel/block: No such 
> > > > file or directory
> > > 
> > > Kernel v4.16 is too old to run these tests. The srp-test script needs the
> > > following commit that went upstream in kernel v4.17-rc1:
> > > 
> > > 63cf1a902c9d ("IB/srpt: Add RDMA/CM support")
> > 
> > Same output on Jens' for-next branch.
> 
> Others have been able to run the srp-test software with the instructions
> provided earlier in this e-mail thread. Can you share the kernel messages from
> around the time the test was run (dmesg, /var/log/messages or 
> /var/log/syslog)?
> 
> Thanks,
> 
> Bart.

Bart,

Have you made any progress in porting srp-test to blktests so we don't
have to have this conversation again?

Re: [PATCH RESEND] loop: clear wb_err in bd_inode when detaching backing file

2018-05-21 Thread Jens Axboe

On 5/21/18 12:35 PM, Jeff Layton wrote:
> From: Jeff Layton 
> 
> When a loop block device encounters a writeback error, that error will
> get propagated to the bd_inode's wb_err field. If we then detach the
> backing file from it, attach another and fsync it, we'll get back the
> writeback error that we had from the previous backing file.
> 
> This is a bit of a grey area as POSIX doesn't cover loop devices, but it
> is somewhat counterintuitive.
> 
> If we detach a backing file from the loopdev while there are still
> unreported errors, take it as a sign that we're no longer interested in
> the previous file, and clear out the wb_err in the loop blockdev.

Applied, thanks Jeff.

-- 
Jens Axboe

[PATCH RESEND] loop: clear wb_err in bd_inode when detaching backing file

2018-05-21 Thread Jeff Layton

From: Jeff Layton 

When a loop block device encounters a writeback error, that error will
get propagated to the bd_inode's wb_err field. If we then detach the
backing file from it, attach another and fsync it, we'll get back the
writeback error that we had from the previous backing file.

This is a bit of a grey area as POSIX doesn't cover loop devices, but it
is somewhat counterintuitive.

If we detach a backing file from the loopdev while there are still
unreported errors, take it as a sign that we're no longer interested in
the previous file, and clear out the wb_err in the loop blockdev.

Reported-and-Tested-by: Theodore Y. Ts'o 
Signed-off-by: Jeff Layton 
---
 drivers/block/loop.c | 1 +
 1 file changed, 1 insertion(+)

Resend to fix the attributions.

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5d4e31655d96..55cf554bc914 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1068,6 +1068,7 @@ static int loop_clr_fd(struct loop_device *lo)
if (bdev) {
bdput(bdev);
invalidate_bdev(bdev);
+   bdev->bd_inode->i_mapping->wb_err = 0;
}
set_capacity(lo->lo_disk, 0);
loop_sysfs_exit(lo);
-- 
2.17.0

Re: [PATCH] loop: clear wb_err in bd_inode when detaching backing file

2018-05-21 Thread Jens Axboe

On 5/21/18 12:26 PM, Jeff Layton wrote:
> On Mon, 2018-05-21 at 08:37 -0400, Jeff Layton wrote:
>> From: Jeff Layton 
>>
>> When a loop block device encounters a writeback error, that error will
>> get propagated to the bd_inode's wb_err field. If we then detach the
>> backing file from it, attach another and fsync it, we'll get back the
>> writeback error that we had from the previous backing file.
>>
>> This is a bit of a grey area as POSIX doesn't cover loop devices, but it
>> is somewhat counterintuitive.
>>
>> If we detach a backing file from the loopdev while there are still
>> unreported errors, take it as a sign that we're no longer interested in
>> the previous file, and clear out the wb_err in the loop blockdev.
>>
>> Signed-off-by: Jeff Layton 
>> ---
>>  drivers/block/loop.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
>> index 5d4e31655d96..55cf554bc914 100644
>> --- a/drivers/block/loop.c
>> +++ b/drivers/block/loop.c
>> @@ -1068,6 +1068,7 @@ static int loop_clr_fd(struct loop_device *lo)
>>  if (bdev) {
>>  bdput(bdev);
>>  invalidate_bdev(bdev);
>> +bdev->bd_inode->i_mapping->wb_err = 0;
>>  }
>>  set_capacity(lo->lo_disk, 0);
>>  loop_sysfs_exit(lo);
> 
> Jens,
> 
> I sent this to the mailing lists earlier, but it occurs to me that
> loop.c changes should probably go through the block tree. Would you
> mind picking this up and feeding it to next and on to Linus?
> 
> I think we probably want this in v4.17 if possible, as it is a
> regression from v4.16 (due to commit b4678df184b3) and was causing some
> failures in xfstests runs.
> 
> Hmm, now that I think about it, we should probably also give Ted
> "Reported-and-Tested-by" credit here too. Let me know if you'd rather I
> resend with that added.

Yeah, I can funnel it for 4.17 - a quick resend with the right
attributions would be appreciated, thanks.

-- 
Jens Axboe

Re: [PATCH blktests] Documentation: document prerequisite scriptlets

2018-05-21 Thread Omar Sandoval

On Mon, May 14, 2018 at 01:40:42PM +0200, Johannes Thumshirn wrote:
> The config file is bash and it gets sourced, so all bash magic is
> doable in there as well. Document it so others don't have to
> re-discover this gem as well.

I'm supportive of this...

> Signed-off-by: Johannes Thumshirn 
> ---
>  Documentation/running-tests.md | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/Documentation/running-tests.md b/Documentation/running-tests.md
> index a479d5e94c5e..b477c0679683 100644
> --- a/Documentation/running-tests.md
> +++ b/Documentation/running-tests.md
> @@ -66,3 +66,15 @@ command line option.
>  QUICK_RUN=1
>  TIMEOUT=30
>  ```
> +
> +### Pre-test setups
> +
> +Some tests, may need special prerequisites, like configfs being
> +mounted for NVMe over Fabrics tests. You can add your custom bash
> +scriptlets to `config` to get this done, e.g.:
> +
> +```sh
> +if ! test $(grep -q configfs /proc/mounts) ; then
> +mount -t configfs none /sys/kernel/config
> +fi
> +```

But I'm curious about this specific example. Is this not mounted for you
automatically? I'm guessing systemd does it for me on my setup.

Re: [PATCH] loop: clear wb_err in bd_inode when detaching backing file

2018-05-21 Thread Jeff Layton

On Mon, 2018-05-21 at 08:37 -0400, Jeff Layton wrote:
> From: Jeff Layton 
> 
> When a loop block device encounters a writeback error, that error will
> get propagated to the bd_inode's wb_err field. If we then detach the
> backing file from it, attach another and fsync it, we'll get back the
> writeback error that we had from the previous backing file.
> 
> This is a bit of a grey area as POSIX doesn't cover loop devices, but it
> is somewhat counterintuitive.
> 
> If we detach a backing file from the loopdev while there are still
> unreported errors, take it as a sign that we're no longer interested in
> the previous file, and clear out the wb_err in the loop blockdev.
> 
> Signed-off-by: Jeff Layton 
> ---
>  drivers/block/loop.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 5d4e31655d96..55cf554bc914 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1068,6 +1068,7 @@ static int loop_clr_fd(struct loop_device *lo)
>   if (bdev) {
>   bdput(bdev);
>   invalidate_bdev(bdev);
> + bdev->bd_inode->i_mapping->wb_err = 0;
>   }
>   set_capacity(lo->lo_disk, 0);
>   loop_sysfs_exit(lo);

Jens,

I sent this to the mailing lists earlier, but it occurs to me that
loop.c changes should probably go through the block tree. Would you
mind picking this up and feeding it to next and on to Linus?

I think we probably want this in v4.17 if possible, as it is a
regression from v4.16 (due to commit b4678df184b3) and was causing some
failures in xfstests runs.

Hmm, now that I think about it, we should probably also give Ted
"Reported-and-Tested-by" credit here too. Let me know if you'd rather I
resend with that added.

Thanks,
-- 
Jeff Layton

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Mike Snitzer

On Mon, May 21 2018 at  1:37pm -0400,
Kent Overstreet  wrote:

> 
> Uh, you came across as upset and paranoid to me too. Chalk it up to email? :)

Awesome.  See how easy it is to make someone with purely constructive
questions and feedback come off as upset and paranoid?

The tipping point occurs when bait is set with:
"It's not like ".
Then:
"Let's focus on getting it reviewed, rather than pontificate on what
could potentially go all wrong with this."

Message received: less pontificating, more doing!

And here I thought that focusing on what could go wrong (across all code
touched) was review.  But OK, I'm the one that made this all weird ;)

It is what it is at this point.

> I personally don't care, I have no horse in this race. This particular patch
> series wasn't my idea, Christoph wanted all these conversions done so
> bioset_create() could be deleted. If you want us to hold off on the dm patch 
> for
> awhile until someone can get around to testing it or whatever (since I don't
> have tests for everything I pushed) that would be perfectly fine by me.

As I clarified already: this isn't about DM.

Every single data structure change in this series should be reviewed for
unforeseen alignment consequences.  Jens seemed to say that is
worthwhile.  Not sure if he'll do it or we divide it up.  If we divide
it up a temp topic branch should be published for others to inspect.

Could be alignment hasn't been a historic concern for a bunch of the
data structures changed in this series.. if so then all we can do is fix
up any obvious potential for false sharing.

Mike

[PATCH] block: Verify whether blk_queue_enter() is used when necessary

2018-05-21 Thread Bart Van Assche

It is required to protect blkg_lookup() calls with a blk_queue_enter() /
blk_queue_exit() pair. Since it is nontrivial to verify whether this is
the case, verify this at runtime. Only perform this verification if
CONFIG_LOCKDEP=y to avoid that unnecessary runtime overhead is added.

Note: using lockdep to verify whether blkg_lookup() is protected
correctly is not possible since lock_acquire() and lock_release()
must be called from the same task and since blk_queue_enter() and
blk_queue_exit() can be called from different tasks.

Suggested-by: Tejun Heo 
Signed-off-by: Bart Van Assche 
Cc: Tejun Heo 
---
 block/blk-cgroup.c  |  2 ++
 block/blk-core.c| 24 
 include/linux/blk-cgroup.h  |  2 ++
 include/linux/blkdev.h  | 11 +++
 include/linux/percpu-refcount.h |  2 ++
 lib/percpu-refcount.c   | 25 +
 6 files changed, 66 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index eb85cb87c40f..78822dcfa0da 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -145,6 +145,8 @@ struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
 {
struct blkcg_gq *blkg;
 
+   WARN_ON_ONCE(!blk_entered_queue(q));
+
/*
 * Hint didn't match.  Look up from the radix tree.  Note that the
 * hint can only be updated under queue_lock as otherwise @blkg
diff --git a/block/blk-core.c b/block/blk-core.c
index 8b9e5dc882f4..b6fa6a9f7daa 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -687,6 +687,9 @@ EXPORT_SYMBOL_GPL(blk_queue_bypass_end);
 
 void blk_set_queue_dying(struct request_queue *q)
 {
+#ifdef CONFIG_PROVE_LOCKING
+   q->cleanup_queue_task = current;
+#endif
blk_queue_flag_set(QUEUE_FLAG_DYING, q);
 
/*
@@ -907,6 +910,25 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
+#ifdef CONFIG_PROVE_LOCKING
+/**
+ * blk_entered_queue() - whether or not it is safe to access cgroup information
+ * @q: request queue pointer
+ *
+ * In order to avoid races between accessing cgroup information and the cgroup
+ * information removal from inside blk_cleanup_queue(), any code that accesses
+ * cgroup information must either be protected by blk_queue_enter() and/or
+ * blk_queue_enter_live() or must be called after the queue has been marked
+ * dying from the same task that called blk_cleanup_queue().
+ */
+bool blk_entered_queue(struct request_queue *q)
+{
+   return (blk_queue_dying(q) && current == q->cleanup_queue_task) ||
+   percpu_ref_read(>q_usage_counter) > 0;
+}
+EXPORT_SYMBOL(blk_entered_queue);
+#endif
+
 /**
  * blk_queue_enter() - try to increase q->q_usage_counter
  * @q: request queue pointer
@@ -2254,6 +2276,8 @@ generic_make_request_checks(struct bio *bio)
goto end_io;
}
 
+   WARN_ON_ONCE(!blk_entered_queue(q));
+
/*
 * For a REQ_NOWAIT based request, return -EOPNOTSUPP
 * if queue is not a request based queue.
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 6c666fd7de3c..3b8512c259aa 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -266,6 +266,8 @@ static inline struct blkcg_gq *__blkg_lookup(struct blkcg 
*blkcg,
 {
struct blkcg_gq *blkg;
 
+   WARN_ON_ONCE(!blk_entered_queue(q));
+
if (blkcg == _root)
return q->root_blkg;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 780e4ea80d4d..0ed23677c36f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -649,6 +649,9 @@ struct request_queue {
 
int bypass_depth;
atomic_tmq_freeze_depth;
+#ifdef CONFIG_PROVE_LOCKING
+   struct task_struct  *cleanup_queue_task;
+#endif
 
 #if defined(CONFIG_BLK_DEV_BSG)
bsg_job_fn  *bsg_job_fn;
@@ -1000,6 +1003,14 @@ extern int sg_scsi_ioctl(struct request_queue *, struct 
gendisk *, fmode_t,
 
 extern int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags);
 extern void blk_queue_exit(struct request_queue *q);
+#ifdef CONFIG_PROVE_LOCKING
+extern bool blk_entered_queue(struct request_queue *q);
+#else
+static inline bool blk_entered_queue(struct request_queue *q)
+{
+   return true;
+}
+#endif
 extern void blk_start_queue(struct request_queue *q);
 extern void blk_start_queue_async(struct request_queue *q);
 extern void blk_stop_queue(struct request_queue *q);
diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 009cdf3d65b6..5707289ba828 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -331,4 +331,6 @@ static inline bool percpu_ref_is_zero(struct percpu_ref 
*ref)
return !atomic_long_read(>count);
 }
 
+unsigned long percpu_ref_read(struct percpu_ref *ref);
+
 #endif
diff --git

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Kent Overstreet

On Mon, May 21, 2018 at 12:09:14PM -0400, Mike Snitzer wrote:
> On Mon, May 21 2018 at 11:36am -0400,
> Jens Axboe  wrote:
> 
> > On 5/21/18 9:18 AM, Mike Snitzer wrote:
> > > On Mon, May 21 2018 at 11:09am -0400,
> > > Jens Axboe  wrote:
> > > 
> > >> On 5/21/18 9:04 AM, Mike Snitzer wrote:
> > >>> On Mon, May 21 2018 at 10:52am -0400,
> > >>> Jens Axboe  wrote:
> > >>>
> ...
> >  IMHO you're making a big deal out of something that should not be.
> > >>>
> > >>> I raised an issue that had seemingly not been considered at all.  Not
> > >>> making a big deal.  Raising it for others' benefit.
> > >>>
> >  If the dm bits are that sensitive and cache line honed to perfection
> >  already due to previous regressions in that area, then it might
> >  not be a bad idea to have some compile checks for false cacheline
> >  sharing between sensitive members, or spilling of a sub-struct
> >  into multiple cachelines.
> > 
> >  It's not like this was pushed behind your back. It's posted for
> >  review. It's quite possible the net change is a win for dm. Let's
> >  focus on getting it reviewed, rather than pontificate on what
> >  could potentially go all wrong with this.
> > >>>
> > >>> Why are you making this personal?  Or purely about DM?  I'm merely
> > >>> pointing out this change isn't something that can be given a quick
> > >>> blanket "looks good".
> > >>
> > >> I'm not making this personal at all?! You raised a (valid) concern,
> > >> I'm merely stating why I don't think it's a high risk issue. I'm
> > >> assuming your worry is related to dm, as those are the reports
> > >> that would ultimately land on your desk.
> > > 
> > > Then we'll just agree to disagree with what this implies: "It's not like
> > > this was pushed behind your back."
> > 
> > I'm afraid you've lost me now - it was not pushed behind your back,
> > it was posted for review, with you on the CC list. Not trying to
> > be deliberately dense here, I just don't see what our disagreement is.
> 
> You're having an off day ;)  Mondays and all?
> 
> I just raised an alignment concern that needs to be considered during
> review by all stakeholders.  Wasn't upset at all.  Maybe my email tone
> came off otherwise?
> 
> And then you got confused by me pointing out how it was weird for you to
> suggest I felt this stuff was pushed behind my back.. and went on to
> think I really think that.  It's almost like you're a confused hypnotist
> seeding me with paranoid dilutions. ;)

Uh, you came across as upset and paranoid to me too. Chalk it up to email? :)

I personally don't care, I have no horse in this race. This particular patch
series wasn't my idea, Christoph wanted all these conversions done so
bioset_create() could be deleted. If you want us to hold off on the dm patch for
awhile until someone can get around to testing it or whatever (since I don't
have tests for everything I pushed) that would be perfectly fine by me.

[PATCH v12 2/2] blk-mq: Rework blk-mq timeout handling again

2018-05-21 Thread Bart Van Assche

Recently the blk-mq timeout handling code was reworked. See also Tejun
Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
This patch reworks the blk-mq timeout handling code again. The timeout
handling code is simplified by introducing a state machine per request.
This change avoids that the blk-mq timeout handling code ignores
completions that occur after blk_mq_check_expired() has been called and
before blk_mq_rq_timed_out() has been called.

Fix this race as follows:
- Replace the __deadline member of struct request by a new member
  called das that contains the generation number, state and deadline.
  Only 32 bits are used for the deadline field such that all three
  fields occupy only 64 bits. This change reduces the maximum supported
  request timeout value from (2**63/HZ) to (2**31/HZ).
- Remove all request member variables that became superfluous due to
  this change: gstate, gstate_seq and aborted_gstate_sync.
- Remove the request state information that became superfluous due to
  this patch, namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the code that became superfluous due to this change, namely
  the RCU lock and unlock statements in blk_mq_complete_request() and
  also the synchronize_rcu() call in the timeout handler.

Notes:
- A spinlock is used to protect atomic changes of rq->das on those
  architectures that do not provide a cmpxchg64() implementation.
- Atomic instructions are only used to update the request state if
  a concurrent request state change could be in progress.
- blk_add_timer() has been split into two functions - one for the
  legacy block layer and one for blk-mq.

Signed-off-by: Bart Van Assche 
Cc: Tejun Heo 
Cc: Christoph Hellwig 
Cc: Jianchao Wang 
Cc: Ming Lei 
Cc: Sebastian Ott 
Cc: Sagi Grimberg 
Cc: Israel Rukshin ,
Cc: Max Gurtovoy 
---
 block/blk-core.c   |   6 --
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c | 238 ++---
 block/blk-mq.h |  64 ++---
 block/blk-timeout.c| 133 ++-
 block/blk.h|  11 +--
 include/linux/blkdev.h |  48 +-
 7 files changed, 263 insertions(+), 238 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 341501c5e239..a7301fcda734 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request 
*rq)
rq->internal_tag = -1;
rq->start_time_ns = ktime_get_ns();
rq->part = NULL;
-   seqcount_init(>gstate_seq);
-   u64_stats_init(>aborted_gstate_sync);
-   /*
-* See comment of blk_mq_init_request
-*/
-   WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 3080e18cb859..ffa622366922 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -344,7 +344,6 @@ static const char *const rqf_name[] = {
RQF_NAME(STATS),
RQF_NAME(SPECIAL_PAYLOAD),
RQF_NAME(ZONE_WRITE_LOCKED),
-   RQF_NAME(MQ_TIMEOUT_EXPIRED),
RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 64630caaf27e..af9375ffa4b1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -318,7 +318,7 @@ static struct request *blk_mq_rq_ctx_init(struct 
blk_mq_alloc_data *data,
rq->special = NULL;
/* tag was already set */
rq->extra_len = 0;
-   rq->__deadline = 0;
+   WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 
INIT_LIST_HEAD(>timeout_list);
rq->timeout = 0;
@@ -465,6 +465,39 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
 
+/**
+ * blk_mq_change_rq_state - atomically test and set request state
+ * @rq: Request pointer.
+ * @old_state: Old request state.
+ * @new_state: New request state.
+ *
+ * Returns %true if and only if the old state was @old and if the state has
+ * been changed into @new.
+ */
+static bool blk_mq_change_rq_state(struct request *rq,
+  enum mq_rq_state old_state,
+  enum mq_rq_state new_state)
+{
+   union blk_deadline_and_state das = READ_ONCE(rq->das);
+   union blk_deadline_and_state old_val = das;
+   union blk_deadline_and_state new_val = das;
+
+   WARN_ON_ONCE(new_state == MQ_RQ_IN_FLIGHT);
+
+   old_val.state = old_state;
+   new_val.state = new_state;
+   /*
+* For transitions from state in-flight to another state cmpxchg64()
+* must be used. For other state transitions it is safe to use
+* WRITE_ONCE().
+*/
+   if (old_state

[PATCH v12 1/2] ARM: cmpxchg64: Only define cmpxchg64() if not using the Thumb instruction set

2018-05-21 Thread Bart Van Assche

Since the implementation of cmpxchg64() uses instructions that are not
Thumb instructions, only define cmpxchg64() if not building in Thumb
mode. This patch allows the next patch in this series to check for
cmpxchg64() support using #if defined(cmpxchg64).

Signed-off-by: Bart Van Assche 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Russell King 
---
 arch/arm/include/asm/cmpxchg.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/arm/include/asm/cmpxchg.h b/arch/arm/include/asm/cmpxchg.h
index 8b701f8e175c..41ab99bc04ca 100644
--- a/arch/arm/include/asm/cmpxchg.h
+++ b/arch/arm/include/asm/cmpxchg.h
@@ -141,7 +141,9 @@ static inline unsigned long __xchg(unsigned long x, 
volatile void *ptr, int size
sizeof(*(ptr)));\
 })
 
+#ifndef CONFIG_THUMB2_KERNEL
 #define cmpxchg64_local(ptr, o, n) __cmpxchg64_local_generic((ptr), (o), (n))
+#endif /* CONFIG_THUMB2_KERNEL */
 
 #include 
 
@@ -241,6 +243,7 @@ static inline unsigned long __cmpxchg_local(volatile void 
*ptr,
sizeof(*(ptr)));\
 })
 
+#ifndef CONFIG_THUMB2_KERNEL
 static inline unsigned long long __cmpxchg64(unsigned long long *ptr,
 unsigned long long old,
 unsigned long long new)
@@ -273,6 +276,7 @@ static inline unsigned long long __cmpxchg64(unsigned long 
long *ptr,
 })
 
 #define cmpxchg64_local(ptr, o, n) cmpxchg64_relaxed((ptr), (o), (n))
+#endif /* CONFIG_THUMB2_KERNEL */
 
 #endif /* __LINUX_ARM_ARCH__ >= 6 */
 
-- 
2.16.3

[PATCH v12 0/2] blk-mq: Rework blk-mq timeout handling again

2018-05-21 Thread Bart Van Assche

Hello Jens,

This patch series reworks blk-mq timeout handling by introducing a state
machine per request. Please consider this patch series for inclusion in the
upstream kernel.

Bart.

Changes compared to v11:
- Reworked patch 1/2: instead of introducing CONFIG_ARCH_HAVE_CMPXCHG64, make
  sure that cmpxchg64() is only defined if it can be used.

Changes compared to v10:
- In patch 1/2, added "default y if 64BIT" to the "config ARCH_HAVE_CMPXCHG64"
  entry in arch/Kconfig. Left out the "select ARCH_HAVE_CMPXCHG64" statements
  that became superfluous due to this change (alpha, arm64, powerpc and s390).
- Also in patch 1/2, only select ARCH_HAVE_CMPXCHG64 if X86_CMPXCHG64 has been
  selected.
- In patch 2/2, moved blk_mq_change_rq_state() from blk-mq.h to blk-mq.c.
- Added a comment header above __blk_mq_requeue_request() and
  blk_mq_requeue_request().
- Documented the MQ_RQ_* state transitions in block/blk-mq.h.
- Left out the fourth argument of blk_mq_rq_set_deadline().

Changes compared to v9:
- Addressed multiple comments related to patch 1/2: added
  CONFIG_ARCH_HAVE_CMPXCHG64 for riscv, modified
  features/locking/cmpxchg64/arch-support.txt as requested and made the
  order of the symbols in the arch/*/Kconfig alphabetical where possible.

Changes compared to v8:
- Split into two patches.
- Moved the spin_lock_init() call from blk_mq_rq_ctx_init() into
  blk_mq_init_request().
- Fixed the deadline set by blk_add_timer().
- Surrounded the das_lock member with #ifndef CONFIG_ARCH_HAVE_CMPXCHG64 /
  #endif.

Changes compared to v7:
- Fixed the generation number mechanism. Note: with this patch applied the
  behavior of the block layer does not depend on the generation number.
- Added more 32-bit architectures to the list of architectures on which
  cmpxchg64() should not be used.

Changes compared to v6:
- Used a union instead of bit manipulations to store multiple values into
  a single 64-bit field.
- Reduced the size of the timeout field from 64 to 32 bits.
- Made sure that the block layer still builds with this patch applied
  for the sh and mips architectures.
- Fixed two sparse warnings that were introduced by this patch in the
  WRITE_ONCE() calls.

Changes compared to v5:
- Restored the synchronize_rcu() call between marking a request for timeout
  handling and the actual timeout handling to avoid that timeout handling
  starts while .queue_rq() is still in progress if the timeout is very short.
- Only use cmpxchg() if another context could attempt to change the request
  state concurrently. Use WRITE_ONCE() otherwise.

Changes compared to v4:
- Addressed multiple review comments from Christoph. The most important are
  that atomic_long_cmpxchg() has been changed into cmpxchg() and also that
  there is now a nice and clean split between the legacy and blk-mq versions
  of blk_add_timer().
- Changed the patch name and modified the patch description because there is
  disagreement about whether or not the v4.16 blk-mq core can complete a
  single request twice. Kept the "Cc: stable" tag because of
  https://bugzilla.kernel.org/show_bug.cgi?id=199077.

Changes compared to v3 (see also 
https://www.mail-archive.com/linux-block@vger.kernel.org/msg20073.html):
- Removed the spinlock again that was introduced to protect the request state.
  v4 uses atomic_long_cmpxchg() instead.
- Split __deadline into two variables - one for the legacy block layer and one
  for blk-mq.

Changes compared to v2 
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg18338.html):
- Rebased and retested on top of kernel v4.16.

Changes compared to v1 
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html):
- Removed the gstate and aborted_gstate members of struct request and used
  the __deadline member to encode both the generation and state information.

Bart Van Assche (2):
  ARM: cmpxchg64: Only define cmpxchg64() if not using the Thumb
instruction set
  blk-mq: Rework blk-mq timeout handling again

 arch/arm/include/asm/cmpxchg.h |   4 +
 block/blk-core.c   |   6 --
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c | 238 +++--
 block/blk-mq.h |  64 +--
 block/blk-timeout.c| 133 +++
 block/blk.h|  11 +-
 include/linux/blkdev.h |  48 +
 8 files changed, 267 insertions(+), 238 deletions(-)

-- 
2.16.3

Re: [PATCH 1/6] nvme: Sync request queues on reset

2018-05-21 Thread Keith Busch

On Tue, May 22, 2018 at 12:08:37AM +0800, Ming Lei wrote:
> Please take a look at blk_mq_complete_request(). Even with Bart's
> change, the request still won't be completed by driver. The request can
> only be completed by either driver or blk-mq, not both.

So you're saying blk-mq can't complete a request the driver returned to
blk-mq to complete. And that's the nvme driver's problem to fix?

Re: [PATCH 3/6] nvme: Move all IO out of controller reset

2018-05-21 Thread Keith Busch

On Tue, May 22, 2018 at 12:04:53AM +0800, Ming Lei wrote:
> On Mon, May 21, 2018 at 09:44:33AM -0600, Keith Busch wrote:
> > On Mon, May 21, 2018 at 11:34:27PM +0800, Ming Lei wrote:
> > > nvme_dev_disable() quiesces queues first before killing queues.
> > > 
> > > If queues are quiesced during or before nvme_wait_freeze() is run
> > > from the 2nd part of reset, the 2nd part can't move on, and IO hang
> > > is caused. Finally no reset can be scheduled at all.
> > 
> > But this patch moves nvme_wait_freeze outside the reset path, so I'm
> > afraid I'm unable to follow how you've concluded the wait freeze is
> > somehow part of the reset.
> 
> For example:
> 
> 1) the 1st timeout event:
> 
> - nvme_dev_disable()
> - reset
> - scan_work
> 
> 2) the 2nd timeout event:
> 
> nvme_dev_disable() may come just after nvme_start_queues() in
> the above reset of the 1st timeout. And nvme_timeout() won't
> schedule a new reset since the controller state is NVME_CTRL_CONNECTING.

Let me get this straight -- you're saying nvme_start_queues is going
to somehow immediately trigger timeout work? I can't see how that could
possibly happen in real life, but we can just remove it and use the existing
nvme_start_ctrl to handle that in the LIVE state.

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Jens Axboe

On 5/21/18 10:09 AM, Mike Snitzer wrote:
> On Mon, May 21 2018 at 11:36am -0400,
> Jens Axboe  wrote:
> 
>> On 5/21/18 9:18 AM, Mike Snitzer wrote:
>>> On Mon, May 21 2018 at 11:09am -0400,
>>> Jens Axboe  wrote:
>>>
 On 5/21/18 9:04 AM, Mike Snitzer wrote:
> On Mon, May 21 2018 at 10:52am -0400,
> Jens Axboe  wrote:
>
> ...
>> IMHO you're making a big deal out of something that should not be.
>
> I raised an issue that had seemingly not been considered at all.  Not
> making a big deal.  Raising it for others' benefit.
>
>> If the dm bits are that sensitive and cache line honed to perfection
>> already due to previous regressions in that area, then it might
>> not be a bad idea to have some compile checks for false cacheline
>> sharing between sensitive members, or spilling of a sub-struct
>> into multiple cachelines.
>>
>> It's not like this was pushed behind your back. It's posted for
>> review. It's quite possible the net change is a win for dm. Let's
>> focus on getting it reviewed, rather than pontificate on what
>> could potentially go all wrong with this.
>
> Why are you making this personal?  Or purely about DM?  I'm merely
> pointing out this change isn't something that can be given a quick
> blanket "looks good".

 I'm not making this personal at all?! You raised a (valid) concern,
 I'm merely stating why I don't think it's a high risk issue. I'm
 assuming your worry is related to dm, as those are the reports
 that would ultimately land on your desk.
>>>
>>> Then we'll just agree to disagree with what this implies: "It's not like
>>> this was pushed behind your back."
>>
>> I'm afraid you've lost me now - it was not pushed behind your back,
>> it was posted for review, with you on the CC list. Not trying to
>> be deliberately dense here, I just don't see what our disagreement is.
> 
> You're having an off day ;)  Mondays and all?
> 
> I just raised an alignment concern that needs to be considered during
> review by all stakeholders.  Wasn't upset at all.  Maybe my email tone
> came off otherwise?
> 
> And then you got confused by me pointing out how it was weird for you to
> suggest I felt this stuff was pushed behind my back.. and went on to
> think I really think that.  It's almost like you're a confused hypnotist
> seeding me with paranoid dilutions. ;)
> 
> /me waits for Jens to snap his fingers so he can just slowly step away
> from this line of discussion that is solidly dead now...

Mike, wtf are you talking about.

>>> Reality is I'm fine with the change.  Just think there is follow-on work
>>> (now or later) that is needed.
>>
>> It's not hard to run the quick struct layout checks or alignment. If
>> there's a concern, that should be done now, instead of being deferred to
>> later. There's no point merging something that we expect to have
>> follow-on work. If that's the case, then it should not be merged. There
>> are no dependencies in the patchset, except that the last patch
>> obviously can't be applied until all of the previous ones are in.
> 
> Cool, sounds good.
> 
>> I took a quick look at the struct mapped_device layout, which I'm
>> assuming is the most important on the dm side. It's pretty big to
>> begin with, obviously this makes it bigger since we're now
>> embedding the bio_sets. On my config, doesn't show any false sharing
>> that would be problematic, or layout changes that would worry me. FWIW.
> 
> Great, thanks, do you happen to have a tree you can push so others can
> get a quick compile and look at the series fully applied?

I do not, I simply ran a quick analysis of the layout, then applied the
full patchset, and repeated that exercise. Literally a 5 minute thing.
I haven't applied the series, so haven't pushed anything out.

> BTW, I'm upstream DM maintainer but I have a role in caring for related
> subsystems (e.g. block core, etc) in downstream releases.. *cough* RHEL.
> So this alignment concern wasn't ever purely about DM.

I tend to care about that too...

-- 
Jens Axboe

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Mike Snitzer

On Mon, May 21 2018 at 11:36am -0400,
Jens Axboe  wrote:

> On 5/21/18 9:18 AM, Mike Snitzer wrote:
> > On Mon, May 21 2018 at 11:09am -0400,
> > Jens Axboe  wrote:
> > 
> >> On 5/21/18 9:04 AM, Mike Snitzer wrote:
> >>> On Mon, May 21 2018 at 10:52am -0400,
> >>> Jens Axboe  wrote:
> >>>
...
>  IMHO you're making a big deal out of something that should not be.
> >>>
> >>> I raised an issue that had seemingly not been considered at all.  Not
> >>> making a big deal.  Raising it for others' benefit.
> >>>
>  If the dm bits are that sensitive and cache line honed to perfection
>  already due to previous regressions in that area, then it might
>  not be a bad idea to have some compile checks for false cacheline
>  sharing between sensitive members, or spilling of a sub-struct
>  into multiple cachelines.
> 
>  It's not like this was pushed behind your back. It's posted for
>  review. It's quite possible the net change is a win for dm. Let's
>  focus on getting it reviewed, rather than pontificate on what
>  could potentially go all wrong with this.
> >>>
> >>> Why are you making this personal?  Or purely about DM?  I'm merely
> >>> pointing out this change isn't something that can be given a quick
> >>> blanket "looks good".
> >>
> >> I'm not making this personal at all?! You raised a (valid) concern,
> >> I'm merely stating why I don't think it's a high risk issue. I'm
> >> assuming your worry is related to dm, as those are the reports
> >> that would ultimately land on your desk.
> > 
> > Then we'll just agree to disagree with what this implies: "It's not like
> > this was pushed behind your back."
> 
> I'm afraid you've lost me now - it was not pushed behind your back,
> it was posted for review, with you on the CC list. Not trying to
> be deliberately dense here, I just don't see what our disagreement is.

You're having an off day ;)  Mondays and all?

I just raised an alignment concern that needs to be considered during
review by all stakeholders.  Wasn't upset at all.  Maybe my email tone
came off otherwise?

And then you got confused by me pointing out how it was weird for you to
suggest I felt this stuff was pushed behind my back.. and went on to
think I really think that.  It's almost like you're a confused hypnotist
seeding me with paranoid dilutions. ;)

/me waits for Jens to snap his fingers so he can just slowly step away
from this line of discussion that is solidly dead now...

> > Reality is I'm fine with the change.  Just think there is follow-on work
> > (now or later) that is needed.
> 
> It's not hard to run the quick struct layout checks or alignment. If
> there's a concern, that should be done now, instead of being deferred to
> later. There's no point merging something that we expect to have
> follow-on work. If that's the case, then it should not be merged. There
> are no dependencies in the patchset, except that the last patch
> obviously can't be applied until all of the previous ones are in.

Cool, sounds good.

> I took a quick look at the struct mapped_device layout, which I'm
> assuming is the most important on the dm side. It's pretty big to
> begin with, obviously this makes it bigger since we're now
> embedding the bio_sets. On my config, doesn't show any false sharing
> that would be problematic, or layout changes that would worry me. FWIW.

Great, thanks, do you happen to have a tree you can push so others can
get a quick compile and look at the series fully applied?

BTW, I'm upstream DM maintainer but I have a role in caring for related
subsystems (e.g. block core, etc) in downstream releases.. *cough* RHEL.
So this alignment concern wasn't ever purely about DM.

Mike

Re: [PATCH 1/6] nvme: Sync request queues on reset

2018-05-21 Thread Ming Lei

On Mon, May 21, 2018 at 09:59:09AM -0600, Keith Busch wrote:
> On Mon, May 21, 2018 at 11:25:43PM +0800, Ming Lei wrote:
> > On Mon, May 21, 2018 at 08:04:13AM -0600, Keith Busch wrote:
> > > On Sat, May 19, 2018 at 08:01:42AM +0800, Ming Lei wrote:
> > > > > You keep saying that, but the controller state is global to the
> > > > > controller. It doesn't matter which namespace request_queue started 
> > > > > the
> > > > > reset: every namespaces request queue sees the RESETTING controller 
> > > > > state
> > > > 
> > > > When timeouts come, the global state of RESETTING may not be updated
> > > > yet, so all the timeouts may not observe the state.
> > > 
> > > Even prior to the RESETING state, every single command, no matter
> > > which namespace or request_queue it came on, is reclaimed by the driver.
> > > There *should* be no requests to timeout after nvme_dev_disable is called
> > > because the nvme driver returned control of all requests in the tagset
> > > to blk-mq.
> > 
> > The timed-out requests won't be canceled by nvme_dev_disable().
> 
> ??? nvme_dev_disable cancels all started requests. There are no exceptions.

Please take a look at blk_mq_complete_request(). Even with Bart's
change, the request still won't be completed by driver. The request can
only be completed by either driver or blk-mq, not both.

> 
> > If the timed-out requests is handled as RESET_TIMER, there may
> > be new timeout event triggered again.
> > 
> > > 
> > > In any case, if blk-mq decides it won't complete those requests, we
> > > can just swap the order in the reset_work: sync first, uncondintionally
> > > disable. Does the following snippet look more okay?
> > > 
> > > ---
> > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > > index 17a0190bd88f..42af077ee07a 100644
> > > --- a/drivers/nvme/host/pci.c
> > > +++ b/drivers/nvme/host/pci.c
> > > @@ -2307,11 +2307,14 @@ static void nvme_reset_work(struct work_struct 
> > > *work)
> > >   goto out;
> > >  
> > >   /*
> > > -  * If we're called to reset a live controller first shut it down before
> > > -  * moving on.
> > > +  * Ensure there are no timeout work in progress prior to forcefully
> > > +  * disabling the queue. There is no harm in disabling the device even
> > > +  * when it was already disabled, as this will forcefully reclaim any
> > > +  * IOs that are stuck due to blk-mq's timeout handling that prevents
> > > +  * timed out requests from completing.
> > >*/
> > > - if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
> > > - nvme_dev_disable(dev, false);
> > > + nvme_sync_queues(>ctrl);
> > > + nvme_dev_disable(dev, false);
> > 
> > That may not work reliably too.
> > 
> > For example, request A from NS_0 is timed-out and handled as RESET_TIMER,
> > meantime request B from NS_1 is timed-out and handled as EH_HANDLED.
> 
> Meanwhile, request B's nvme_dev_disable prior to returning EH_HANDLED
> cancels all requests, which includes request A.

No, please see my above comment.


Thanks,
Ming

Re: [PATCH 3/6] nvme: Move all IO out of controller reset

2018-05-21 Thread Ming Lei

On Mon, May 21, 2018 at 09:44:33AM -0600, Keith Busch wrote:
> On Mon, May 21, 2018 at 11:34:27PM +0800, Ming Lei wrote:
> > nvme_dev_disable() quiesces queues first before killing queues.
> > 
> > If queues are quiesced during or before nvme_wait_freeze() is run
> > from the 2nd part of reset, the 2nd part can't move on, and IO hang
> > is caused. Finally no reset can be scheduled at all.
> 
> But this patch moves nvme_wait_freeze outside the reset path, so I'm
> afraid I'm unable to follow how you've concluded the wait freeze is
> somehow part of the reset.

For example:

1) the 1st timeout event:

- nvme_dev_disable()
- reset
- scan_work

2) the 2nd timeout event:

nvme_dev_disable() may come just after nvme_start_queues() in
the above reset of the 1st timeout. And nvme_timeout() won't
schedule a new reset since the controller state is NVME_CTRL_CONNECTING.

Then scan_work in 1st timeout still may hang for ever.

Thanks,
Ming

[PATCH v5 4/5] fs: blkdev set bio prio from kiocb prio

2018-05-21 Thread adam . manzanares

From: Adam Manzanares 

Now that kiocb has an ioprio field copy this over to the bio when it is
created from the kiocb.

Signed-off-by: Adam Manzanares 
---
 fs/block_dev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 7ec920e27065..da1e94d2bb75 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -355,6 +355,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter, int nr_pages)
bio->bi_write_hint = iocb->ki_hint;
bio->bi_private = dio;
bio->bi_end_io = blkdev_bio_end_io;
+   bio->bi_ioprio = iocb->ki_ioprio;
 
ret = bio_iov_iter_get_pages(bio, iter);
if (unlikely(ret)) {
-- 
2.15.1

[PATCH v5 3/5] fs: Add aio iopriority support

2018-05-21 Thread adam . manzanares

From: Adam Manzanares 

This is the per-I/O equivalent of the ioprio_set system call.

When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the
newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field.

We set the blkdev bio iopriority unconditionally, so we need to guarantee the 
kiocb is initialized properly. Added changes to the loopback driver and 
init_sync_kiocb to achieve this.

This patch depends on block: add ioprio_check_cap function.

Signed-off-by: Adam Manzanares 
---
 drivers/block/loop.c |  3 +++
 fs/aio.c | 16 
 include/linux/fs.h   |  3 +++
 include/uapi/linux/aio_abi.h |  1 +
 4 files changed, 23 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5d4e31655d96..dd98dfd97f5e 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -76,6 +76,8 @@
 #include 
 #include 
 #include 
+#include 
+
 #include "loop.h"
 
 #include 
@@ -559,6 +561,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
cmd->iocb.ki_filp = file;
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
+   cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
if (cmd->css)
kthread_associate_blkcg(cmd->css);
 
diff --git a/fs/aio.c b/fs/aio.c
index f3eae5d5771b..44b4572be524 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1451,6 +1451,22 @@ static int aio_prep_rw(struct kiocb *req, struct iocb 
*iocb)
if (iocb->aio_flags & IOCB_FLAG_RESFD)
req->ki_flags |= IOCB_EVENTFD;
req->ki_hint = file_write_hint(req->ki_filp);
+   if (iocb->aio_flags & IOCB_FLAG_IOPRIO) {
+   /*
+* If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then
+* aio_reqprio is interpreted as an I/O scheduling
+* class and priority.
+*/
+   ret = ioprio_check_cap(iocb->aio_reqprio);
+   if (ret) {
+   pr_debug("aio ioprio check cap error\n");
+   return -EINVAL;
+   }
+
+   req->ki_ioprio = iocb->aio_reqprio;
+   } else
+   req->ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
+
ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags);
if (unlikely(ret))
fput(req->ki_filp);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 50de40dbbb85..73b749ed3ea1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -302,6 +303,7 @@ struct kiocb {
void*private;
int ki_flags;
u16 ki_hint;
+   u16 ki_ioprio; /* See linux/ioprio.h */
 } __randomize_layout;
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
@@ -1942,6 +1944,7 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, 
struct file *filp)
.ki_filp = filp,
.ki_flags = iocb_flags(filp),
.ki_hint = ki_hint_validate(file_write_hint(filp)),
+   .ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0),
};
 }
 
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index 2c0a3415beee..d4e768d55d14 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -55,6 +55,7 @@ enum {
  *   is valid.
  */
 #define IOCB_FLAG_RESFD(1 << 0)
+#define IOCB_FLAG_IOPRIO   (1 << 1)
 
 /* read() from /dev/aio returns these structures. */
 struct io_event {
-- 
2.15.1

[PATCH v5 5/5] fs: iomap dio set bio prio from kiocb prio

2018-05-21 Thread adam . manzanares

From: Adam Manzanares 

Now that kiocb has an ioprio field copy this over to the bio when it is
created from the kiocb during direct IO.

Signed-off-by: Adam Manzanares 
---
 fs/iomap.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index afd163586aa0..65aae194aeca 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -919,6 +919,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t 
length,
bio->bi_iter.bi_sector =
(iomap->addr + pos - iomap->offset) >> 9;
bio->bi_write_hint = dio->iocb->ki_hint;
+   bio->bi_ioprio = dio->iocb->ki_ioprio;
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;
 
-- 
2.15.1

[PATCH v5 1/5] block: add ioprio_check_cap function

2018-05-21 Thread adam . manzanares

From: Adam Manzanares 

Aio per command iopriority support introduces a second interface between
userland and the kernel capable of passing iopriority. The aio interface also
needs the ability to verify that the submitting context has sufficient
priviledges to submit IOPRIO_RT commands. This patch creates the
ioprio_check_cap function to be used by the ioprio_set system call and also by
the aio interface.

Signed-off-by: Adam Manzanares 
Reviewed-by: Christoph Hellwig 
---
 block/ioprio.c | 22 --
 include/linux/ioprio.h |  2 ++
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/block/ioprio.c b/block/ioprio.c
index 6f5d0b6625e3..f9821080c92c 100644
--- a/block/ioprio.c
+++ b/block/ioprio.c
@@ -61,15 +61,10 @@ int set_task_ioprio(struct task_struct *task, int ioprio)
 }
 EXPORT_SYMBOL_GPL(set_task_ioprio);
 
-SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio)
+int ioprio_check_cap(int ioprio)
 {
int class = IOPRIO_PRIO_CLASS(ioprio);
int data = IOPRIO_PRIO_DATA(ioprio);
-   struct task_struct *p, *g;
-   struct user_struct *user;
-   struct pid *pgrp;
-   kuid_t uid;
-   int ret;
 
switch (class) {
case IOPRIO_CLASS_RT:
@@ -92,6 +87,21 @@ SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, 
ioprio)
return -EINVAL;
}
 
+   return 0;
+}
+
+SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio)
+{
+   struct task_struct *p, *g;
+   struct user_struct *user;
+   struct pid *pgrp;
+   kuid_t uid;
+   int ret;
+
+   ret = ioprio_check_cap(ioprio);
+   if (ret)
+   return ret;
+
ret = -ESRCH;
rcu_read_lock();
switch (which) {
diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
index 627efac73e6d..4a28cec49ec3 100644
--- a/include/linux/ioprio.h
+++ b/include/linux/ioprio.h
@@ -77,4 +77,6 @@ extern int ioprio_best(unsigned short aprio, unsigned short 
bprio);
 
 extern int set_task_ioprio(struct task_struct *task, int ioprio);
 
+extern int ioprio_check_cap(int ioprio);
+
 #endif
-- 
2.15.1

[PATCH v5 2/5] fs: Convert kiocb rw_hint from enum to u16

2018-05-21 Thread adam . manzanares

From: Adam Manzanares 

In order to avoid kiocb bloat for per command iopriority support, rw_hint
is converted from enum to a u16. Added a guard around ki_hint assigment.

Signed-off-by: Adam Manzanares 
---
 include/linux/fs.h | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7f07977bdfd7..50de40dbbb85 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -284,6 +284,8 @@ enum rw_hint {
WRITE_LIFE_EXTREME  = RWH_WRITE_LIFE_EXTREME,
 };
 
+#define MAX_KI_HINT((1 << 16) - 1) /* ki_hint type is u16 */
+
 #define IOCB_EVENTFD   (1 << 0)
 #define IOCB_APPEND(1 << 1)
 #define IOCB_DIRECT(1 << 2)
@@ -299,7 +301,7 @@ struct kiocb {
void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
void*private;
int ki_flags;
-   enum rw_hintki_hint;
+   u16 ki_hint;
 } __randomize_layout;
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
@@ -1927,12 +1929,19 @@ static inline enum rw_hint file_write_hint(struct file 
*file)
 
 static inline int iocb_flags(struct file *file);
 
+static inline u16 ki_hint_validate(enum rw_hint hint)
+{
+   if (hint > MAX_KI_HINT)
+   return 0;
+   return hint;
+}
+
 static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
 {
*kiocb = (struct kiocb) {
.ki_filp = filp,
.ki_flags = iocb_flags(filp),
-   .ki_hint = file_write_hint(filp),
+   .ki_hint = ki_hint_validate(file_write_hint(filp)),
};
 }
 
-- 
2.15.1

[PATCH v5 0/5] AIO add per-command iopriority

2018-05-21 Thread adam . manzanares

From: Adam Manzanares 

This is the per-I/O equivalent of the ioprio_set system call.
See the following link for performance implications on a SATA HDD:
https://lkml.org/lkml/2016/12/6/495

First patch factors ioprio_check_cap function out of ioprio_set system call to
also be used by the aio ioprio interface.

Second patch converts kiocb ki_hint field to a u16 to avoid kiocb bloat.

Third patch passes ioprio hint from aio iocb to kiocb and inititalizes kiocb
ioprio value appropriately when it is not explicitly set.

Fourth patch enables the feature for blkdev.

Fifth patch enables the feature for iomap direct IO

Note: this work is based on top of linux-vfs/for-next

v2: merge patches
use IOCB_FLAG_IOPRIO
validate intended use with IOCB_IOPRIO
add linux-api and linux-block to cc

v3: add ioprio_check_cap function
convert kiocb ki_hint to u16
use ioprio_check_cap when adding ioprio to kiocb in aio.c

v4: handle IOCB_IOPRIO in aio_prep_rw
note patch 3 depends on patch 1 in commit msg

v5: rename ki_hint_valid -> ki_hint_validate 
remove ki_hint_validate comment and whitespace
remove IOCB_IOPRIO flag
initialize kiocb to have no priority 

Adam Manzanares (5):
  block: add ioprio_check_cap function
  fs: Convert kiocb rw_hint from enum to u16
  fs: Add aio iopriority support
  fs: blkdev set bio prio from kiocb prio
  fs: iomap dio set bio prio from kiocb prio

 block/ioprio.c   | 22 --
 drivers/block/loop.c |  3 +++
 fs/aio.c | 16 
 fs/block_dev.c   |  1 +
 fs/iomap.c   |  1 +
 include/linux/fs.h   | 16 ++--
 include/linux/ioprio.h   |  2 ++
 include/uapi/linux/aio_abi.h |  1 +
 8 files changed, 54 insertions(+), 8 deletions(-)

-- 
2.15.1

Re: [PATCH 1/6] nvme: Sync request queues on reset

2018-05-21 Thread Keith Busch

On Mon, May 21, 2018 at 11:25:43PM +0800, Ming Lei wrote:
> On Mon, May 21, 2018 at 08:04:13AM -0600, Keith Busch wrote:
> > On Sat, May 19, 2018 at 08:01:42AM +0800, Ming Lei wrote:
> > > > You keep saying that, but the controller state is global to the
> > > > controller. It doesn't matter which namespace request_queue started the
> > > > reset: every namespaces request queue sees the RESETTING controller 
> > > > state
> > > 
> > > When timeouts come, the global state of RESETTING may not be updated
> > > yet, so all the timeouts may not observe the state.
> > 
> > Even prior to the RESETING state, every single command, no matter
> > which namespace or request_queue it came on, is reclaimed by the driver.
> > There *should* be no requests to timeout after nvme_dev_disable is called
> > because the nvme driver returned control of all requests in the tagset
> > to blk-mq.
> 
> The timed-out requests won't be canceled by nvme_dev_disable().

??? nvme_dev_disable cancels all started requests. There are no exceptions.

> If the timed-out requests is handled as RESET_TIMER, there may
> be new timeout event triggered again.
> 
> > 
> > In any case, if blk-mq decides it won't complete those requests, we
> > can just swap the order in the reset_work: sync first, uncondintionally
> > disable. Does the following snippet look more okay?
> > 
> > ---
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 17a0190bd88f..42af077ee07a 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -2307,11 +2307,14 @@ static void nvme_reset_work(struct work_struct 
> > *work)
> > goto out;
> >  
> > /*
> > -* If we're called to reset a live controller first shut it down before
> > -* moving on.
> > +* Ensure there are no timeout work in progress prior to forcefully
> > +* disabling the queue. There is no harm in disabling the device even
> > +* when it was already disabled, as this will forcefully reclaim any
> > +* IOs that are stuck due to blk-mq's timeout handling that prevents
> > +* timed out requests from completing.
> >  */
> > -   if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
> > -   nvme_dev_disable(dev, false);
> > +   nvme_sync_queues(>ctrl);
> > +   nvme_dev_disable(dev, false);
> 
> That may not work reliably too.
> 
> For example, request A from NS_0 is timed-out and handled as RESET_TIMER,
> meantime request B from NS_1 is timed-out and handled as EH_HANDLED.

Meanwhile, request B's nvme_dev_disable prior to returning EH_HANDLED
cancels all requests, which includes request A.

> When the above reset work is run for handling timeout of req B,
> new timeout event on request A may come just between the above
> nvme_sync_queues() and nvme_dev_disable()

The sync queues either stops the timer from running, or waits for it to
complete. We are in the RESETTING state: if request A's timeout happens
to be running, we're not restarting the timer; we're returning EH_HANDLED.

> then nvme_dev_disable()
> can't cover request A, and finally the timed-out event for req A
> will nvme_dev_disable() when the current reset is just in-progress,
> then this reset can't move on, and IO hang is caused.

At no point in the nvme_reset are we waiting for any IO to complete.
Reset continues to make forward progress.

Re: [PATCH 3/6] nvme: Move all IO out of controller reset

2018-05-21 Thread Keith Busch

On Mon, May 21, 2018 at 11:34:27PM +0800, Ming Lei wrote:
> nvme_dev_disable() quiesces queues first before killing queues.
> 
> If queues are quiesced during or before nvme_wait_freeze() is run
> from the 2nd part of reset, the 2nd part can't move on, and IO hang
> is caused. Finally no reset can be scheduled at all.

But this patch moves nvme_wait_freeze outside the reset path, so I'm
afraid I'm unable to follow how you've concluded the wait freeze is
somehow part of the reset.

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Jens Axboe

On 5/21/18 9:18 AM, Mike Snitzer wrote:
> On Mon, May 21 2018 at 11:09am -0400,
> Jens Axboe  wrote:
> 
>> On 5/21/18 9:04 AM, Mike Snitzer wrote:
>>> On Mon, May 21 2018 at 10:52am -0400,
>>> Jens Axboe  wrote:
>>>
 On 5/21/18 8:47 AM, Mike Snitzer wrote:
> On Mon, May 21 2018 at 10:36am -0400,
> Jens Axboe  wrote:
>
>> On 5/21/18 8:31 AM, Mike Snitzer wrote:
>>> On Mon, May 21 2018 at 10:19am -0400,
>>> Jens Axboe  wrote:
>>>
 On 5/21/18 8:03 AM, Mike Snitzer wrote:
> On Sun, May 20 2018 at  6:25pm -0400,
> Kent Overstreet  wrote:
>
>> Jens - this series does the rest of the conversions that Christoph 
>> wanted, and
>> drops bioset_create().
>>
>> Only lightly tested, but the changes are pretty mechanical. Based on 
>> your
>> for-next tree.
>
> By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set'
> you've altered the alignment of members in data structures.  So I'll
> need to audit all the data structures you've modified in DM.
>
> Could we get the backstory on _why_ you're making this change?
> Would go a long way to helping me appreciate why this is a good use of
> anyone's time.

 Yeah, it's in the first series, it gets rid of a pointer indirection.
>>>
>>> "Allows mempools to be embedded in other structs, getting rid of a
>>> pointer indirection from allocation fastpaths."
>>>
>>> So this is about using contiguous memory or avoiding partial allocation
>>> failure?  Or both?
>>>
>>> Or more to it?  Just trying to fully appreciate the theory behind the
>>> perceived associated benefit.
>>
>> It's about avoiding a pointer indirection. Instead of having to
>> follow a pointer to get to that struct, it's simple offset math off
>> your main structure.
>>
>>> I do think the increased risk of these embedded bio_set and mempool_t
>>> themselves crossing cachelines, or struct members that follow them doing
>>> so, really detracts from these types of changes.
>>
>> Definitely something to look out for, though most of them should be
>> per-dev structures and not in-flight structures. That makes it a bit
>> less sensitive. But can't hurt to audit the layouts and adjust if
>> necessary. This is why it's posted for review :-)
>
> This isn't something that is easily caught upfront.  Yes we can all be
> busy little beavers with pahole to audit alignment.  But chances are
> most people won't do it.
>
> Reality is there is potential for a regression due to false sharing to
> creep in if a hot struct member suddenly starts straddling a cacheline.
> That type of NUMA performance killer is pretty insidious and somewhat
> tedious to hunt down even when looking for it with specialized tools:
> https://joemario.github.io/blog/2016/09/01/c2c-blog/

 IMHO you're making a big deal out of something that should not be.
>>>
>>> I raised an issue that had seemingly not been considered at all.  Not
>>> making a big deal.  Raising it for others' benefit.
>>>
 If the dm bits are that sensitive and cache line honed to perfection
 already due to previous regressions in that area, then it might
 not be a bad idea to have some compile checks for false cacheline
 sharing between sensitive members, or spilling of a sub-struct
 into multiple cachelines.

 It's not like this was pushed behind your back. It's posted for
 review. It's quite possible the net change is a win for dm. Let's
 focus on getting it reviewed, rather than pontificate on what
 could potentially go all wrong with this.
>>>
>>> Why are you making this personal?  Or purely about DM?  I'm merely
>>> pointing out this change isn't something that can be given a quick
>>> blanket "looks good".
>>
>> I'm not making this personal at all?! You raised a (valid) concern,
>> I'm merely stating why I don't think it's a high risk issue. I'm
>> assuming your worry is related to dm, as those are the reports
>> that would ultimately land on your desk.
> 
> Then we'll just agree to disagree with what this implies: "It's not like
> this was pushed behind your back."

I'm afraid you've lost me now - it was not pushed behind your back,
it was posted for review, with you on the CC list. Not trying to
be deliberately dense here, I just don't see what our disagreement is.

> Reality is I'm fine with the change.  Just think there is follow-on work
> (now or later) that is needed.

It's not hard to run the quick struct layout checks or alignment. If
there's a concern, that should be done now, instead of being deferred to
later. There's no point merging something that we expect to have
follow-on work. If that's the

Re: [PATCH v11 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64

2018-05-21 Thread Bart Van Assche

On Fri, 2018-05-18 at 11:32 -0700, h...@zytor.com wrote:
> On May 18, 2018 11:00:05 AM PDT, Bart Van Assche  
> wrote:
> > The next patch in this series introduces a call to cmpxchg64()
> > in the block layer core for those architectures on which this
> > functionality is available. Make it possible to test whether
> > cmpxchg64() is available by introducing CONFIG_ARCH_HAVE_CMPXCHG64.
> > 
> > Signed-off-by: Bart Van Assche 
> > Cc: Catalin Marinas 
> > Cc: Will Deacon 
> > Cc: Tony Luck 
> > Cc: Fenghua Yu 
> > Cc: Geert Uytterhoeven 
> > Cc: "James E.J. Bottomley" 
> > Cc: Helge Deller 
> > Cc: Benjamin Herrenschmidt 
> > Cc: Paul Mackerras 
> > Cc: Michael Ellerman 
> > Cc: Martin Schwidefsky 
> > Cc: Heiko Carstens 
> > Cc: David S. Miller 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: H. Peter Anvin 
> > Cc: Chris Zankel 
> > Cc: Max Filippov 
> > Cc: Arnd Bergmann 
> > Cc: Jonathan Corbet 
> > ---
> > .../features/locking/cmpxchg64/arch-support.txt| 33
> > ++
> > arch/Kconfig   |  4 +++
> > arch/arm/Kconfig   |  1 +
> > arch/ia64/Kconfig  |  1 +
> > arch/m68k/Kconfig  |  1 +
> > arch/mips/Kconfig  |  1 +
> > arch/parisc/Kconfig|  1 +
> > arch/riscv/Kconfig |  1 +
> > arch/sparc/Kconfig |  1 +
> > arch/x86/Kconfig   |  1 +
> > arch/xtensa/Kconfig|  1 +
> > 11 files changed, 46 insertions(+)
> > create mode 100644
> > Documentation/features/locking/cmpxchg64/arch-support.txt
> > 
> > diff --git a/Documentation/features/locking/cmpxchg64/arch-support.txt
> > b/Documentation/features/locking/cmpxchg64/arch-support.txt
> > new file mode 100644
> > index ..84bfef7242b2
> > --- /dev/null
> > +++ b/Documentation/features/locking/cmpxchg64/arch-support.txt
> > @@ -0,0 +1,33 @@
> > +#
> > +# Feature name:  cmpxchg64
> > +# Kconfig:   ARCH_HAVE_CMPXCHG64
> > +# description:   arch supports the cmpxchg64() API
> > +#
> > +---
> > +| arch |status|
> > +---
> > +|   alpha: |  ok  |
> > +| arc: |  ..  |
> > +| arm: |  ok  |
> > +|   arm64: |  ok  |
> > +| c6x: |  ..  |
> > +|   h8300: |  ..  |
> > +| hexagon: |  ..  |
> > +|ia64: |  ok  |
> > +|m68k: |  ok  |
> > +|  microblaze: |  ..  |
> > +|mips: |  ok  |
> > +|   nds32: |  ..  |
> > +|   nios2: |  ..  |
> > +|openrisc: |  ..  |
> > +|  parisc: |  ok  |
> > +| powerpc: |  ok  |
> > +|   riscv: |  ok  |
> > +|s390: |  ok  |
> > +|  sh: |  ..  |
> > +|   sparc: |  ok  |
> > +|  um: |  ..  |
> > +|   unicore32: |  ..  |
> > +| x86: |  ok  |
> > +|  xtensa: |  ok  |
> > +---
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index 8e0d665c8d53..9840b2577af1 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -358,6 +358,10 @@ config HAVE_ALIGNED_STRUCT_PAGE
> >   on a struct page for better performance. However selecting this
> >   might increase the size of a struct page by a word.
> > 
> > +config ARCH_HAVE_CMPXCHG64
> > +   bool
> > +   default y if 64BIT
> > +
> > config HAVE_CMPXCHG_LOCAL
> > bool
> > 
> > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> > index a7f8e7f4b88f..02c75697176e 100644
> > --- a/arch/arm/Kconfig
> > +++ b/arch/arm/Kconfig
> > @@ -13,6 +13,7 @@ config ARM
> > select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
> > select ARCH_HAS_STRICT_MODULE_RWX if MMU
> > select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
> > +   select ARCH_HAVE_CMPXCHG64 if !THUMB2_KERNEL
> > select ARCH_HAVE_CUSTOM_GPIO_H
> > select ARCH_HAS_GCOV_PROFILE_ALL
> > select ARCH_MIGHT_HAVE_PC_PARPORT
> > diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
> > index bbe12a038d21..31c49e1482e2 100644
> > --- a/arch/ia64/Kconfig
> > +++ b/arch/ia64/Kconfig
> > @@ -41,6 +41,7 @@ config IA64
> > select GENERIC_PENDING_IRQ if SMP
> > select GENERIC_IRQ_SHOW
> > select GENERIC_IRQ_LEGACY
> > +   select ARCH_HAVE_CMPXCHG64
> > select ARCH_HAVE_NMI_SAFE_CMPXCHG

Re: [PATCH 3/6] nvme: Move all IO out of controller reset

2018-05-21 Thread Ming Lei

On Mon, May 21, 2018 at 09:03:31AM -0600, Keith Busch wrote:
> On Mon, May 21, 2018 at 10:58:51PM +0800, Ming Lei wrote:
> > On Mon, May 21, 2018 at 08:22:19AM -0600, Keith Busch wrote:
> > > On Sat, May 19, 2018 at 07:03:58AM +0800, Ming Lei wrote:
> > > > On Fri, May 18, 2018 at 10:38:20AM -0600, Keith Busch wrote:
> > > > > +
> > > > > + if (unfreeze)
> > > > > + nvme_wait_freeze(>ctrl);
> > > > > +
> > > > 
> > > > timeout may comes just before blk_mq_update_nr_hw_queues() or
> > > > the above nvme_wait_freeze(), then both two may hang forever.
> > > 
> > > Why would it hang forever? The scan_work doesn't stop a timeout from
> > > triggering a reset to reclaim requests necessary to complete a freeze.
> > 
> > nvme_dev_disable() will quiesce queues, then nvme_wait_freeze() or
> > blk_mq_update_nr_hw_queues() may hang forever.
> 
> nvme_dev_disable is just the first part of the timeout sequence. You
> have to follow it through to the reset_work that either restarts or
> kills the queues.

nvme_dev_disable() quiesces queues first before killing queues.

If queues are quiesced during or before nvme_wait_freeze() is run
from the 2nd part of reset, the 2nd part can't move on, and IO hang
is caused. Finally no reset can be scheduled at all.

Thanks,
Ming

Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()

2018-05-21 Thread Paul E. McKenney

On Mon, May 21, 2018 at 08:16:59AM -0700, Linus Torvalds wrote:
> On Mon, May 21, 2018 at 6:51 AM Roman Penyaev <
> roman.peny...@profitbricks.com> wrote:
> 
> > No, I continue from the pointer, which I assigned on the previous IO
> > in order to send IO fairly and keep load balanced.
> 
> Right. And that's exactly what has both me and Paul nervous. You're no
> longer in the RCU domain. You're using a pointer where the lifetime has
> nothing to do with RCU any more.
> 
> Can it be done? Sure. But you need *other* locking for it (that you haven't
> explained), and it's fragile as hell.

He looks to actually have it right, but I would want to see a big comment
on the read side noting the leak of the pointer and documenting why it
is OK.

Thanx, Paul

> It's probably best to not use RCU for it at all, but depend on that "other
> locking" that you have to have anyway, to keep the pointer valid over the
> non-RCU region.
> 
> Linus
>

Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()

2018-05-21 Thread Paul E. McKenney

On Mon, May 21, 2018 at 03:50:10PM +0200, Roman Penyaev wrote:
> On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenney
>  wrote:
> > On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote:
> >> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney
> >>  wrote:
> >> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote:
> >> >> Function is going to be used in transport over RDMA module
> >> >> in subsequent patches.
> >> >>
> >> >> Function returns next element in round-robin fashion,
> >> >> i.e. head will be skipped.  NULL will be returned if list
> >> >> is observed as empty.
> >> >>
> >> >> Signed-off-by: Roman Pen 
> >> >> Cc: Paul E. McKenney 
> >> >> Cc: linux-ker...@vger.kernel.org
> >> >> ---
> >> >>  include/linux/rculist.h | 19 +++
> >> >>  1 file changed, 19 insertions(+)
> >> >>
> >> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> >> >> index 127f534fec94..b0840d5ab25a 100644
> >> >> --- a/include/linux/rculist.h
> >> >> +++ b/include/linux/rculist.h
> >> >> @@ -339,6 +339,25 @@ static inline void 
> >> >> list_splice_tail_init_rcu(struct list_head *list,
> >> >>  })
> >> >>
> >> >>  /**
> >> >> + * list_next_or_null_rr_rcu - get next list element in round-robin 
> >> >> fashion.
> >> >> + * @head:the head for the list.
> >> >> + * @ptr:the list head to take the next element from.
> >> >> + * @type:   the type of the struct this is embedded in.
> >> >> + * @memb:   the name of the list_head within the struct.
> >> >> + *
> >> >> + * Next element returned in round-robin fashion, i.e. head will be 
> >> >> skipped,
> >> >> + * but if list is observed as empty, NULL will be returned.
> >> >> + *
> >> >> + * This primitive may safely run concurrently with the _rcu 
> >> >> list-mutation
> >> >> + * primitives such as list_add_rcu() as long as it's guarded by 
> >> >> rcu_read_lock().
> >> >
> >> > Of course, all the set of list_next_or_null_rr_rcu() invocations that
> >> > are round-robining a given list must all be under the same RCU read-side
> >> > critical section.  For example, the following will break badly:
> >> >
> >> > struct foo *take_rr_step(struct list_head *head, struct foo *ptr)
> >> > {
> >> > struct foo *ret;
> >> >
> >> > rcu_read_lock();
> >> > ret = list_next_or_null_rr_rcu(head, ptr, struct foo, 
> >> > foolist);
> >> > rcu_read_unlock();  /* BUG */
> >> > return ret;
> >> > }
> >> >
> >> > You need a big fat comment stating this, at the very least.  The 
> >> > resulting
> >> > bug can be very hard to trigger and even harder to debug.
> >> >
> >> > And yes, I know that the same restriction applies to list_next_rcu()
> >> > and friends.  The difference is that if you try to invoke those in an
> >> > infinite loop, you will be rapped on the knuckles as soon as you hit
> >> > the list header.  Without that knuckle-rapping, RCU CPU stall warnings
> >> > might tempt people to do something broken like take_rr_step() above.
> >>
> >> Hi Paul,
> >>
> >> I need -rr behaviour for doing IO load-balancing when I choose next RDMA
> >> connection from the list in order to send a request, i.e. my code is
> >> something like the following:
> >>
> >> static struct conn *get_and_set_next_conn(void)
> >> {
> >> struct conn *conn;
> >>
> >> conn = rcu_dereferece(rcu_conn);
> >> if (unlikely(!conn))
> >> return conn;
> >
> > Wait.  Don't you need to restart from the beginning of the list in
> > this case?  Or does the list never have anything added to it and is
> > rcu_conn initially the first element in the list?
> 
> Hi Paul,
> 
> No, I continue from the pointer, which I assigned on the previous IO
> in order to send IO fairly and keep load balanced.
> 
> Initially @rcu_conn points to the first element, but elements can
> be deleted from the list and list can become empty.
> 
> The deletion code is below.
> 
> >
> >> conn = list_next_or_null_rr_rcu(_list,
> >> >entry,
> >> typeof(*conn),
> >> entry);
> >> rcu_assign_pointer(rcu_conn, conn);
> >
> > Linus is correct to doubt this code.  You assign a pointer to the current
> > element to rcu_conn, which is presumably a per-CPU or global variable.
> > So far, so good ...
> 
> I use per-CPU, in the first example I did not show that not to overcomplicate
> the code.
> 
> >
> >> return conn;
> >> }
> >>
> >> rcu_read_lock();
> >> conn = get_and_set_next_conn();
> >> if (unlikely(!conn)) {
> >> /* ... */
> >> }
> >> err = rdma_io(conn,

Re: [PATCH 1/6] nvme: Sync request queues on reset

2018-05-21 Thread Ming Lei

On Mon, May 21, 2018 at 08:04:13AM -0600, Keith Busch wrote:
> On Sat, May 19, 2018 at 08:01:42AM +0800, Ming Lei wrote:
> > > You keep saying that, but the controller state is global to the
> > > controller. It doesn't matter which namespace request_queue started the
> > > reset: every namespaces request queue sees the RESETTING controller state
> > 
> > When timeouts come, the global state of RESETTING may not be updated
> > yet, so all the timeouts may not observe the state.
> 
> Even prior to the RESETING state, every single command, no matter
> which namespace or request_queue it came on, is reclaimed by the driver.
> There *should* be no requests to timeout after nvme_dev_disable is called
> because the nvme driver returned control of all requests in the tagset
> to blk-mq.

The timed-out requests won't be canceled by nvme_dev_disable().

If the timed-out requests is handled as RESET_TIMER, there may
be new timeout event triggered again.

> 
> In any case, if blk-mq decides it won't complete those requests, we
> can just swap the order in the reset_work: sync first, uncondintionally
> disable. Does the following snippet look more okay?
> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 17a0190bd88f..42af077ee07a 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2307,11 +2307,14 @@ static void nvme_reset_work(struct work_struct *work)
>   goto out;
>  
>   /*
> -  * If we're called to reset a live controller first shut it down before
> -  * moving on.
> +  * Ensure there are no timeout work in progress prior to forcefully
> +  * disabling the queue. There is no harm in disabling the device even
> +  * when it was already disabled, as this will forcefully reclaim any
> +  * IOs that are stuck due to blk-mq's timeout handling that prevents
> +  * timed out requests from completing.
>*/
> - if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
> - nvme_dev_disable(dev, false);
> + nvme_sync_queues(>ctrl);
> + nvme_dev_disable(dev, false);

That may not work reliably too.

For example, request A from NS_0 is timed-out and handled as RESET_TIMER,
meantime request B from NS_1 is timed-out and handled as EH_HANDLED.

When the above reset work is run for handling timeout of req B,
new timeout event on request A may come just between the above
nvme_sync_queues() and nvme_dev_disable(), then nvme_dev_disable()
can't cover request A, and finally the timed-out event for req A
will nvme_dev_disable() when the current reset is just in-progress,
then this reset can't move on, and IO hang is caused.


Thanks,
Ming

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Jens Axboe

On 5/21/18 9:12 AM, David Sterba wrote:
> On Mon, May 21, 2018 at 08:19:58AM -0600, Jens Axboe wrote:
>> On 5/21/18 8:03 AM, Mike Snitzer wrote:
>>> On Sun, May 20 2018 at  6:25pm -0400,
>>> Kent Overstreet  wrote:
>>>
 Jens - this series does the rest of the conversions that Christoph wanted, 
 and
 drops bioset_create().

 Only lightly tested, but the changes are pretty mechanical. Based on your
 for-next tree.
>>>
>>> By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set'
>>> you've altered the alignment of members in data structures.  So I'll
>>> need to audit all the data structures you've modified in DM.
>>>
>>> Could we get the backstory on _why_ you're making this change?
>>> Would go a long way to helping me appreciate why this is a good use of
>>> anyone's time.
>>
>> Yeah, it's in the first series, it gets rid of a pointer indirection.
> 
> This should to be also mentioned the changelog of each patch. There are
> 12 subsystems changed, this could be about 10 maintainers and I guess
> everybody has the same question why the change is made.

Agree, the justification should be in this series as well, of course.
Kent, might not be a bad idea to resend with a more descriptive
cover letter.

-- 
Jens Axboe

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Mike Snitzer

On Mon, May 21 2018 at 11:09am -0400,
Jens Axboe  wrote:

> On 5/21/18 9:04 AM, Mike Snitzer wrote:
> > On Mon, May 21 2018 at 10:52am -0400,
> > Jens Axboe  wrote:
> > 
> >> On 5/21/18 8:47 AM, Mike Snitzer wrote:
> >>> On Mon, May 21 2018 at 10:36am -0400,
> >>> Jens Axboe  wrote:
> >>>
>  On 5/21/18 8:31 AM, Mike Snitzer wrote:
> > On Mon, May 21 2018 at 10:19am -0400,
> > Jens Axboe  wrote:
> >
> >> On 5/21/18 8:03 AM, Mike Snitzer wrote:
> >>> On Sun, May 20 2018 at  6:25pm -0400,
> >>> Kent Overstreet  wrote:
> >>>
>  Jens - this series does the rest of the conversions that Christoph 
>  wanted, and
>  drops bioset_create().
> 
>  Only lightly tested, but the changes are pretty mechanical. Based on 
>  your
>  for-next tree.
> >>>
> >>> By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set'
> >>> you've altered the alignment of members in data structures.  So I'll
> >>> need to audit all the data structures you've modified in DM.
> >>>
> >>> Could we get the backstory on _why_ you're making this change?
> >>> Would go a long way to helping me appreciate why this is a good use of
> >>> anyone's time.
> >>
> >> Yeah, it's in the first series, it gets rid of a pointer indirection.
> >
> > "Allows mempools to be embedded in other structs, getting rid of a
> > pointer indirection from allocation fastpaths."
> >
> > So this is about using contiguous memory or avoiding partial allocation
> > failure?  Or both?
> >
> > Or more to it?  Just trying to fully appreciate the theory behind the
> > perceived associated benefit.
> 
>  It's about avoiding a pointer indirection. Instead of having to
>  follow a pointer to get to that struct, it's simple offset math off
>  your main structure.
> 
> > I do think the increased risk of these embedded bio_set and mempool_t
> > themselves crossing cachelines, or struct members that follow them doing
> > so, really detracts from these types of changes.
> 
>  Definitely something to look out for, though most of them should be
>  per-dev structures and not in-flight structures. That makes it a bit
>  less sensitive. But can't hurt to audit the layouts and adjust if
>  necessary. This is why it's posted for review :-)
> >>>
> >>> This isn't something that is easily caught upfront.  Yes we can all be
> >>> busy little beavers with pahole to audit alignment.  But chances are
> >>> most people won't do it.
> >>>
> >>> Reality is there is potential for a regression due to false sharing to
> >>> creep in if a hot struct member suddenly starts straddling a cacheline.
> >>> That type of NUMA performance killer is pretty insidious and somewhat
> >>> tedious to hunt down even when looking for it with specialized tools:
> >>> https://joemario.github.io/blog/2016/09/01/c2c-blog/
> >>
> >> IMHO you're making a big deal out of something that should not be.
> > 
> > I raised an issue that had seemingly not been considered at all.  Not
> > making a big deal.  Raising it for others' benefit.
> > 
> >> If the dm bits are that sensitive and cache line honed to perfection
> >> already due to previous regressions in that area, then it might
> >> not be a bad idea to have some compile checks for false cacheline
> >> sharing between sensitive members, or spilling of a sub-struct
> >> into multiple cachelines.
> >>
> >> It's not like this was pushed behind your back. It's posted for
> >> review. It's quite possible the net change is a win for dm. Let's
> >> focus on getting it reviewed, rather than pontificate on what
> >> could potentially go all wrong with this.
> > 
> > Why are you making this personal?  Or purely about DM?  I'm merely
> > pointing out this change isn't something that can be given a quick
> > blanket "looks good".
> 
> I'm not making this personal at all?! You raised a (valid) concern,
> I'm merely stating why I don't think it's a high risk issue. I'm
> assuming your worry is related to dm, as those are the reports
> that would ultimately land on your desk.

Then we'll just agree to disagree with what this implies: "It's not like
this was pushed behind your back."

Reality is I'm fine with the change.  Just think there is follow-on work
(now or later) that is needed.

Enough said.

Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()

2018-05-21 Thread Linus Torvalds

On Mon, May 21, 2018 at 6:51 AM Roman Penyaev <
roman.peny...@profitbricks.com> wrote:

> No, I continue from the pointer, which I assigned on the previous IO
> in order to send IO fairly and keep load balanced.

Right. And that's exactly what has both me and Paul nervous. You're no
longer in the RCU domain. You're using a pointer where the lifetime has
nothing to do with RCU any more.

Can it be done? Sure. But you need *other* locking for it (that you haven't
explained), and it's fragile as hell.

It's probably best to not use RCU for it at all, but depend on that "other
locking" that you have to have anyway, to keep the pointer valid over the
non-RCU region.

Linus

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread David Sterba

On Mon, May 21, 2018 at 08:19:58AM -0600, Jens Axboe wrote:
> On 5/21/18 8:03 AM, Mike Snitzer wrote:
> > On Sun, May 20 2018 at  6:25pm -0400,
> > Kent Overstreet  wrote:
> > 
> >> Jens - this series does the rest of the conversions that Christoph wanted, 
> >> and
> >> drops bioset_create().
> >>
> >> Only lightly tested, but the changes are pretty mechanical. Based on your
> >> for-next tree.
> > 
> > By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set'
> > you've altered the alignment of members in data structures.  So I'll
> > need to audit all the data structures you've modified in DM.
> > 
> > Could we get the backstory on _why_ you're making this change?
> > Would go a long way to helping me appreciate why this is a good use of
> > anyone's time.
> 
> Yeah, it's in the first series, it gets rid of a pointer indirection.

This should to be also mentioned the changelog of each patch. There are
12 subsystems changed, this could be about 10 maintainers and I guess
everybody has the same question why the change is made.

The conversion is not exactly the same in all patches, the simple
pointer -> static variable can be possibly covered by the same generic
text but as Mike points out the alignment changes should be at least
mentioned for consideration otherwise.

Re: [PATCH 00/10] Misc block layer patches for bcachefs

2018-05-21 Thread Bart Van Assche

On Sun, 2018-05-20 at 19:58 -0400, Kent Overstreet wrote:
> On Sun, May 20, 2018 at 11:40:45PM +, Bart Van Assche wrote:
> > On Sun, 2018-05-20 at 19:21 -0400, Kent Overstreet wrote:
> > > I really have better things to do than debug someone else's tests...
> > > [ ... ]
> > > ../run_tests: line 65: cd: /lib/modules/4.16.0+/kernel/block: No such 
> > > file or directory
> > 
> > Kernel v4.16 is too old to run these tests. The srp-test script needs the
> > following commit that went upstream in kernel v4.17-rc1:
> > 
> > 63cf1a902c9d ("IB/srpt: Add RDMA/CM support")
> 
> Same output on Jens' for-next branch.

Others have been able to run the srp-test software with the instructions
provided earlier in this e-mail thread. Can you share the kernel messages from
around the time the test was run (dmesg, /var/log/messages or /var/log/syslog)?

Thanks,

Bart.

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Jens Axboe

On 5/21/18 9:04 AM, Mike Snitzer wrote:
> On Mon, May 21 2018 at 10:52am -0400,
> Jens Axboe  wrote:
> 
>> On 5/21/18 8:47 AM, Mike Snitzer wrote:
>>> On Mon, May 21 2018 at 10:36am -0400,
>>> Jens Axboe  wrote:
>>>
 On 5/21/18 8:31 AM, Mike Snitzer wrote:
> On Mon, May 21 2018 at 10:19am -0400,
> Jens Axboe  wrote:
>
>> On 5/21/18 8:03 AM, Mike Snitzer wrote:
>>> On Sun, May 20 2018 at  6:25pm -0400,
>>> Kent Overstreet  wrote:
>>>
 Jens - this series does the rest of the conversions that Christoph 
 wanted, and
 drops bioset_create().

 Only lightly tested, but the changes are pretty mechanical. Based on 
 your
 for-next tree.
>>>
>>> By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set'
>>> you've altered the alignment of members in data structures.  So I'll
>>> need to audit all the data structures you've modified in DM.
>>>
>>> Could we get the backstory on _why_ you're making this change?
>>> Would go a long way to helping me appreciate why this is a good use of
>>> anyone's time.
>>
>> Yeah, it's in the first series, it gets rid of a pointer indirection.
>
> "Allows mempools to be embedded in other structs, getting rid of a
> pointer indirection from allocation fastpaths."
>
> So this is about using contiguous memory or avoiding partial allocation
> failure?  Or both?
>
> Or more to it?  Just trying to fully appreciate the theory behind the
> perceived associated benefit.

 It's about avoiding a pointer indirection. Instead of having to
 follow a pointer to get to that struct, it's simple offset math off
 your main structure.

> I do think the increased risk of these embedded bio_set and mempool_t
> themselves crossing cachelines, or struct members that follow them doing
> so, really detracts from these types of changes.

 Definitely something to look out for, though most of them should be
 per-dev structures and not in-flight structures. That makes it a bit
 less sensitive. But can't hurt to audit the layouts and adjust if
 necessary. This is why it's posted for review :-)
>>>
>>> This isn't something that is easily caught upfront.  Yes we can all be
>>> busy little beavers with pahole to audit alignment.  But chances are
>>> most people won't do it.
>>>
>>> Reality is there is potential for a regression due to false sharing to
>>> creep in if a hot struct member suddenly starts straddling a cacheline.
>>> That type of NUMA performance killer is pretty insidious and somewhat
>>> tedious to hunt down even when looking for it with specialized tools:
>>> https://joemario.github.io/blog/2016/09/01/c2c-blog/
>>
>> IMHO you're making a big deal out of something that should not be.
> 
> I raised an issue that had seemingly not been considered at all.  Not
> making a big deal.  Raising it for others' benefit.
> 
>> If the dm bits are that sensitive and cache line honed to perfection
>> already due to previous regressions in that area, then it might
>> not be a bad idea to have some compile checks for false cacheline
>> sharing between sensitive members, or spilling of a sub-struct
>> into multiple cachelines.
>>
>> It's not like this was pushed behind your back. It's posted for
>> review. It's quite possible the net change is a win for dm. Let's
>> focus on getting it reviewed, rather than pontificate on what
>> could potentially go all wrong with this.
> 
> Why are you making this personal?  Or purely about DM?  I'm merely
> pointing out this change isn't something that can be given a quick
> blanket "looks good".

I'm not making this personal at all?! You raised a (valid) concern,
I'm merely stating why I don't think it's a high risk issue. I'm
assuming your worry is related to dm, as those are the reports
that would ultimately land on your desk.

-- 
Jens Axboe

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Mike Snitzer

On Mon, May 21 2018 at 10:52am -0400,
Jens Axboe  wrote:

> On 5/21/18 8:47 AM, Mike Snitzer wrote:
> > On Mon, May 21 2018 at 10:36am -0400,
> > Jens Axboe  wrote:
> > 
> >> On 5/21/18 8:31 AM, Mike Snitzer wrote:
> >>> On Mon, May 21 2018 at 10:19am -0400,
> >>> Jens Axboe  wrote:
> >>>
>  On 5/21/18 8:03 AM, Mike Snitzer wrote:
> > On Sun, May 20 2018 at  6:25pm -0400,
> > Kent Overstreet  wrote:
> >
> >> Jens - this series does the rest of the conversions that Christoph 
> >> wanted, and
> >> drops bioset_create().
> >>
> >> Only lightly tested, but the changes are pretty mechanical. Based on 
> >> your
> >> for-next tree.
> >
> > By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set'
> > you've altered the alignment of members in data structures.  So I'll
> > need to audit all the data structures you've modified in DM.
> >
> > Could we get the backstory on _why_ you're making this change?
> > Would go a long way to helping me appreciate why this is a good use of
> > anyone's time.
> 
>  Yeah, it's in the first series, it gets rid of a pointer indirection.
> >>>
> >>> "Allows mempools to be embedded in other structs, getting rid of a
> >>> pointer indirection from allocation fastpaths."
> >>>
> >>> So this is about using contiguous memory or avoiding partial allocation
> >>> failure?  Or both?
> >>>
> >>> Or more to it?  Just trying to fully appreciate the theory behind the
> >>> perceived associated benefit.
> >>
> >> It's about avoiding a pointer indirection. Instead of having to
> >> follow a pointer to get to that struct, it's simple offset math off
> >> your main structure.
> >>
> >>> I do think the increased risk of these embedded bio_set and mempool_t
> >>> themselves crossing cachelines, or struct members that follow them doing
> >>> so, really detracts from these types of changes.
> >>
> >> Definitely something to look out for, though most of them should be
> >> per-dev structures and not in-flight structures. That makes it a bit
> >> less sensitive. But can't hurt to audit the layouts and adjust if
> >> necessary. This is why it's posted for review :-)
> > 
> > This isn't something that is easily caught upfront.  Yes we can all be
> > busy little beavers with pahole to audit alignment.  But chances are
> > most people won't do it.
> > 
> > Reality is there is potential for a regression due to false sharing to
> > creep in if a hot struct member suddenly starts straddling a cacheline.
> > That type of NUMA performance killer is pretty insidious and somewhat
> > tedious to hunt down even when looking for it with specialized tools:
> > https://joemario.github.io/blog/2016/09/01/c2c-blog/
> 
> IMHO you're making a big deal out of something that should not be.

I raised an issue that had seemingly not been considered at all.  Not
making a big deal.  Raising it for others' benefit.

> If the dm bits are that sensitive and cache line honed to perfection
> already due to previous regressions in that area, then it might
> not be a bad idea to have some compile checks for false cacheline
> sharing between sensitive members, or spilling of a sub-struct
> into multiple cachelines.
> 
> It's not like this was pushed behind your back. It's posted for
> review. It's quite possible the net change is a win for dm. Let's
> focus on getting it reviewed, rather than pontificate on what
> could potentially go all wrong with this.

Why are you making this personal?  Or purely about DM?  I'm merely
pointing out this change isn't something that can be given a quick
blanket "looks good".

Mike

Re: [PATCH 3/6] nvme: Move all IO out of controller reset

2018-05-21 Thread Keith Busch

On Mon, May 21, 2018 at 10:58:51PM +0800, Ming Lei wrote:
> On Mon, May 21, 2018 at 08:22:19AM -0600, Keith Busch wrote:
> > On Sat, May 19, 2018 at 07:03:58AM +0800, Ming Lei wrote:
> > > On Fri, May 18, 2018 at 10:38:20AM -0600, Keith Busch wrote:
> > > > +
> > > > +   if (unfreeze)
> > > > +   nvme_wait_freeze(>ctrl);
> > > > +
> > > 
> > > timeout may comes just before blk_mq_update_nr_hw_queues() or
> > > the above nvme_wait_freeze(), then both two may hang forever.
> > 
> > Why would it hang forever? The scan_work doesn't stop a timeout from
> > triggering a reset to reclaim requests necessary to complete a freeze.
> 
> nvme_dev_disable() will quiesce queues, then nvme_wait_freeze() or
> blk_mq_update_nr_hw_queues() may hang forever.

nvme_dev_disable is just the first part of the timeout sequence. You
have to follow it through to the reset_work that either restarts or
kills the queues.

Re: [PATCH 3/6] nvme: Move all IO out of controller reset

2018-05-21 Thread Ming Lei

On Mon, May 21, 2018 at 08:22:19AM -0600, Keith Busch wrote:
> On Sat, May 19, 2018 at 07:03:58AM +0800, Ming Lei wrote:
> > On Fri, May 18, 2018 at 10:38:20AM -0600, Keith Busch wrote:
> > > +
> > > + if (unfreeze)
> > > + nvme_wait_freeze(>ctrl);
> > > +
> > 
> > timeout may comes just before blk_mq_update_nr_hw_queues() or
> > the above nvme_wait_freeze(), then both two may hang forever.
> 
> Why would it hang forever? The scan_work doesn't stop a timeout from
> triggering a reset to reclaim requests necessary to complete a freeze.

nvme_dev_disable() will quiesce queues, then nvme_wait_freeze() or
blk_mq_update_nr_hw_queues() may hang forever.

Thanks,
Ming

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Jens Axboe

On 5/21/18 8:47 AM, Mike Snitzer wrote:
> On Mon, May 21 2018 at 10:36am -0400,
> Jens Axboe  wrote:
> 
>> On 5/21/18 8:31 AM, Mike Snitzer wrote:
>>> On Mon, May 21 2018 at 10:19am -0400,
>>> Jens Axboe  wrote:
>>>
 On 5/21/18 8:03 AM, Mike Snitzer wrote:
> On Sun, May 20 2018 at  6:25pm -0400,
> Kent Overstreet  wrote:
>
>> Jens - this series does the rest of the conversions that Christoph 
>> wanted, and
>> drops bioset_create().
>>
>> Only lightly tested, but the changes are pretty mechanical. Based on your
>> for-next tree.
>
> By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set'
> you've altered the alignment of members in data structures.  So I'll
> need to audit all the data structures you've modified in DM.
>
> Could we get the backstory on _why_ you're making this change?
> Would go a long way to helping me appreciate why this is a good use of
> anyone's time.

 Yeah, it's in the first series, it gets rid of a pointer indirection.
>>>
>>> "Allows mempools to be embedded in other structs, getting rid of a
>>> pointer indirection from allocation fastpaths."
>>>
>>> So this is about using contiguous memory or avoiding partial allocation
>>> failure?  Or both?
>>>
>>> Or more to it?  Just trying to fully appreciate the theory behind the
>>> perceived associated benefit.
>>
>> It's about avoiding a pointer indirection. Instead of having to
>> follow a pointer to get to that struct, it's simple offset math off
>> your main structure.
>>
>>> I do think the increased risk of these embedded bio_set and mempool_t
>>> themselves crossing cachelines, or struct members that follow them doing
>>> so, really detracts from these types of changes.
>>
>> Definitely something to look out for, though most of them should be
>> per-dev structures and not in-flight structures. That makes it a bit
>> less sensitive. But can't hurt to audit the layouts and adjust if
>> necessary. This is why it's posted for review :-)
> 
> This isn't something that is easily caught upfront.  Yes we can all be
> busy little beavers with pahole to audit alignment.  But chances are
> most people won't do it.
> 
> Reality is there is potential for a regression due to false sharing to
> creep in if a hot struct member suddenly starts straddling a cacheline.
> That type of NUMA performance killer is pretty insidious and somewhat
> tedious to hunt down even when looking for it with specialized tools:
> https://joemario.github.io/blog/2016/09/01/c2c-blog/

IMHO you're making a big deal out of something that should not be.
If the dm bits are that sensitive and cache line honed to perfection
already due to previous regressions in that area, then it might
not be a bad idea to have some compile checks for false cacheline
sharing between sensitive members, or spilling of a sub-struct
into multiple cachelines.

It's not like this was pushed behind your back. It's posted for
review. It's quite possible the net change is a win for dm. Let's
focus on getting it reviewed, rather than pontificate on what
could potentially go all wrong with this.

-- 
Jens Axboe

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Mike Snitzer

On Mon, May 21 2018 at 10:36am -0400,
Jens Axboe  wrote:

> On 5/21/18 8:31 AM, Mike Snitzer wrote:
> > On Mon, May 21 2018 at 10:19am -0400,
> > Jens Axboe  wrote:
> > 
> >> On 5/21/18 8:03 AM, Mike Snitzer wrote:
> >>> On Sun, May 20 2018 at  6:25pm -0400,
> >>> Kent Overstreet  wrote:
> >>>
>  Jens - this series does the rest of the conversions that Christoph 
>  wanted, and
>  drops bioset_create().
> 
>  Only lightly tested, but the changes are pretty mechanical. Based on your
>  for-next tree.
> >>>
> >>> By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set'
> >>> you've altered the alignment of members in data structures.  So I'll
> >>> need to audit all the data structures you've modified in DM.
> >>>
> >>> Could we get the backstory on _why_ you're making this change?
> >>> Would go a long way to helping me appreciate why this is a good use of
> >>> anyone's time.
> >>
> >> Yeah, it's in the first series, it gets rid of a pointer indirection.
> > 
> > "Allows mempools to be embedded in other structs, getting rid of a
> > pointer indirection from allocation fastpaths."
> > 
> > So this is about using contiguous memory or avoiding partial allocation
> > failure?  Or both?
> > 
> > Or more to it?  Just trying to fully appreciate the theory behind the
> > perceived associated benefit.
> 
> It's about avoiding a pointer indirection. Instead of having to
> follow a pointer to get to that struct, it's simple offset math off
> your main structure.
> 
> > I do think the increased risk of these embedded bio_set and mempool_t
> > themselves crossing cachelines, or struct members that follow them doing
> > so, really detracts from these types of changes.
> 
> Definitely something to look out for, though most of them should be
> per-dev structures and not in-flight structures. That makes it a bit
> less sensitive. But can't hurt to audit the layouts and adjust if
> necessary. This is why it's posted for review :-)

This isn't something that is easily caught upfront.  Yes we can all be
busy little beavers with pahole to audit alignment.  But chances are
most people won't do it.

Reality is there is potential for a regression due to false sharing to
creep in if a hot struct member suddenly starts straddling a cacheline.
That type of NUMA performance killer is pretty insidious and somewhat
tedious to hunt down even when looking for it with specialized tools:
https://joemario.github.io/blog/2016/09/01/c2c-blog/

Re: [PATCH] fs: clear writeback errors in inode_init_always

2018-05-21 Thread Theodore Y. Ts'o

On Mon, May 21, 2018 at 07:20:05AM -0400, Jeff Layton wrote:
> 
> As a side note, it looks like __loop_update_dio will discard an error
> from vfs_fsync, so certain ioctls against a loop device can cause errors
> to be lost. It seems like those ought to get propagated to userland or
> to the blockdev's mapping somehow.

I'm not sure it's worth it.  All of the ioctls that call
loop_update_dio are used just to configure the loop device.  In
practice they are never called once the loop device is actually
running.  It might be a better choice is to just define that errors
get reset if there is any attempt to reconfigure the loop device.

One of these ioctls change the offset that a loop device maps onto the
backing file.  So for example, while offset 0 of the loop device might
have corresponds beginning of the backing file, after executing the
ioctl, offset 0 of the loop device no corresponds to offset 2MB of the
backing store.  This might happen if we are resetting the loop device
to point at the first partition of a disk image, for example.  I
really don't think that preserving the error status when we are doing
that kind of radical configuration makes sense.

Or for example, when we change the block size of the loop device; if
the underlying backing store is an IBM Mainframe DASD with a 2k block
size, the reason why the error was signalled was because there was an
attempt to write a 1k block onto a 2k block device.  So again,
resetting the error status of the loop device is the right thing to
do.

The final thing that's worth perhaps exploring is whether or not we
need all of these knobs and ioctls.  In particular, LOOP_CHANGE_FD is
used by losetup.  It exists because back in prehistory, some
distribution installer needed it for some obscure corner case.  If I
remember correctly, it had to do with switching out the initramfs
floppy disk with the root file system floppy disk.  We've wasted
developer bandwidth trying to deal with syzbot complaints about that
ioctl, and it's not clear it was worth the effort, because it's not
clear any real code actually _needs_ that ioctl.  It might have been
easier to comment out the ioctl, and see if anyone screamed.

Given that loop device is maintianed mostly on the margins, and it's
not clear that lots of complicated error handling during setup is
really necessary, this is basically a please to keep things simple, or
at least no more complicated than it has to be.

Cheers,

- Ted

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Jens Axboe

On 5/21/18 8:31 AM, Mike Snitzer wrote:
> On Mon, May 21 2018 at 10:19am -0400,
> Jens Axboe  wrote:
> 
>> On 5/21/18 8:03 AM, Mike Snitzer wrote:
>>> On Sun, May 20 2018 at  6:25pm -0400,
>>> Kent Overstreet  wrote:
>>>
 Jens - this series does the rest of the conversions that Christoph wanted, 
 and
 drops bioset_create().

 Only lightly tested, but the changes are pretty mechanical. Based on your
 for-next tree.
>>>
>>> By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set'
>>> you've altered the alignment of members in data structures.  So I'll
>>> need to audit all the data structures you've modified in DM.
>>>
>>> Could we get the backstory on _why_ you're making this change?
>>> Would go a long way to helping me appreciate why this is a good use of
>>> anyone's time.
>>
>> Yeah, it's in the first series, it gets rid of a pointer indirection.
> 
> "Allows mempools to be embedded in other structs, getting rid of a
> pointer indirection from allocation fastpaths."
> 
> So this is about using contiguous memory or avoiding partial allocation
> failure?  Or both?
> 
> Or more to it?  Just trying to fully appreciate the theory behind the
> perceived associated benefit.

It's about avoiding a pointer indirection. Instead of having to
follow a pointer to get to that struct, it's simple offset math off
your main structure.

> I do think the increased risk of these embedded bio_set and mempool_t
> themselves crossing cachelines, or struct members that follow them doing
> so, really detracts from these types of changes.

Definitely something to look out for, though most of them should be
per-dev structures and not in-flight structures. That makes it a bit
less sensitive. But can't hurt to audit the layouts and adjust if
necessary. This is why it's posted for review :-)

-- 
Jens Axboe

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Mike Snitzer

On Mon, May 21 2018 at 10:19am -0400,
Jens Axboe  wrote:

> On 5/21/18 8:03 AM, Mike Snitzer wrote:
> > On Sun, May 20 2018 at  6:25pm -0400,
> > Kent Overstreet  wrote:
> > 
> >> Jens - this series does the rest of the conversions that Christoph wanted, 
> >> and
> >> drops bioset_create().
> >>
> >> Only lightly tested, but the changes are pretty mechanical. Based on your
> >> for-next tree.
> > 
> > By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set'
> > you've altered the alignment of members in data structures.  So I'll
> > need to audit all the data structures you've modified in DM.
> > 
> > Could we get the backstory on _why_ you're making this change?
> > Would go a long way to helping me appreciate why this is a good use of
> > anyone's time.
> 
> Yeah, it's in the first series, it gets rid of a pointer indirection.

"Allows mempools to be embedded in other structs, getting rid of a
pointer indirection from allocation fastpaths."

So this is about using contiguous memory or avoiding partial allocation
failure?  Or both?

Or more to it?  Just trying to fully appreciate the theory behind the
perceived associated benefit.

I do think the increased risk of these embedded bio_set and mempool_t
themselves crossing cachelines, or struct members that follow them doing
so, really detracts from these types of changes.

Thanks,
Mike

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Jens Axboe

On 5/20/18 4:25 PM, Kent Overstreet wrote:
> Jens - this series does the rest of the conversions that Christoph wanted, and
> drops bioset_create().
> 
> Only lightly tested, but the changes are pretty mechanical. Based on your
> for-next tree.

Looks good to me. I'll let it simmer for a bit to give folks a chance
to comment on it.

-- 
Jens Axboe

Re: [PATCH 3/6] nvme: Move all IO out of controller reset

2018-05-21 Thread Keith Busch

On Sat, May 19, 2018 at 07:03:58AM +0800, Ming Lei wrote:
> On Fri, May 18, 2018 at 10:38:20AM -0600, Keith Busch wrote:
> > +
> > +   if (unfreeze)
> > +   nvme_wait_freeze(>ctrl);
> > +
> 
> timeout may comes just before blk_mq_update_nr_hw_queues() or
> the above nvme_wait_freeze(), then both two may hang forever.

Why would it hang forever? The scan_work doesn't stop a timeout from
triggering a reset to reclaim requests necessary to complete a freeze.

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Jens Axboe

On 5/21/18 8:03 AM, Mike Snitzer wrote:
> On Sun, May 20 2018 at  6:25pm -0400,
> Kent Overstreet  wrote:
> 
>> Jens - this series does the rest of the conversions that Christoph wanted, 
>> and
>> drops bioset_create().
>>
>> Only lightly tested, but the changes are pretty mechanical. Based on your
>> for-next tree.
> 
> By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set'
> you've altered the alignment of members in data structures.  So I'll
> need to audit all the data structures you've modified in DM.
> 
> Could we get the backstory on _why_ you're making this change?
> Would go a long way to helping me appreciate why this is a good use of
> anyone's time.

Yeah, it's in the first series, it gets rid of a pointer indirection.

-- 
Jens Axboe

Re: [PATCH blktests] Fix block/011 to not use sysfs for device disabling

2018-05-21 Thread Keith Busch

On Mon, May 21, 2018 at 02:37:56AM -0400, Yi Zhang wrote:
> Hi Keith
> I tried this patch on my R730 Server, but it lead to system hang after 
> setpci, could you help check it, thanks.
> 
> Console log:
> storageqe-62 login: 
> Kernel 4.17.0-rc5 on an x86_64
> 
> storageqe-62 login: [ 1058.118258] {1}[Hardware Error]: Hardware error from 
> APEI Generic Hardware Error Source: 3
> [ 1058.118261] {1}[Hardware Error]: event severity: fatal
> [ 1058.118262] {1}[Hardware Error]:  Error 0, type: fatal
> [ 1058.118265] {1}[Hardware Error]:   section_type: PCIe error
> [ 1058.118266] {1}[Hardware Error]:   port_type: 0, PCIe end point
> [ 1058.118267] {1}[Hardware Error]:   version: 1.16
> [ 1058.118269] {1}[Hardware Error]:   command: 0x0400, status: 0x0010
> [ 1058.118270] {1}[Hardware Error]:   device_id: :85:00.0
> [ 1058.118271] {1}[Hardware Error]:   slot: 0
> [ 1058.118271] {1}[Hardware Error]:   secondary_bus: 0x00
> [ 1058.118273] {1}[Hardware Error]:   vendor_id: 0x144d, device_id: 0xa821
> [ 1058.118274] {1}[Hardware Error]:   class_code: 020801
> [ 1058.118275] Kernel panic - not syncing: Fatal hardware error!
> [ 1058.118301] Kernel Offset: 0x1480 from 0x8100 (relocation 
> range: 0x8000-0xbfff)

Thanks for the notice. The test may be going to far with the config
registers it's touching. Let me see if we just do the BME bit as Ming
suggested fixes this.

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Mike Snitzer

On Sun, May 20 2018 at  6:25pm -0400,
Kent Overstreet  wrote:

> Jens - this series does the rest of the conversions that Christoph wanted, and
> drops bioset_create().
> 
> Only lightly tested, but the changes are pretty mechanical. Based on your
> for-next tree.

By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set'
you've altered the alignment of members in data structures.  So I'll
need to audit all the data structures you've modified in DM.

Could we get the backstory on _why_ you're making this change?
Would go a long way to helping me appreciate why this is a good use of
anyone's time.

Thanks,
Mike

Re: [PATCH 1/6] nvme: Sync request queues on reset

2018-05-21 Thread Keith Busch

On Sat, May 19, 2018 at 08:01:42AM +0800, Ming Lei wrote:
> > You keep saying that, but the controller state is global to the
> > controller. It doesn't matter which namespace request_queue started the
> > reset: every namespaces request queue sees the RESETTING controller state
> 
> When timeouts come, the global state of RESETTING may not be updated
> yet, so all the timeouts may not observe the state.

Even prior to the RESETING state, every single command, no matter
which namespace or request_queue it came on, is reclaimed by the driver.
There *should* be no requests to timeout after nvme_dev_disable is called
because the nvme driver returned control of all requests in the tagset
to blk-mq.

In any case, if blk-mq decides it won't complete those requests, we
can just swap the order in the reset_work: sync first, uncondintionally
disable. Does the following snippet look more okay?

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 17a0190bd88f..42af077ee07a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2307,11 +2307,14 @@ static void nvme_reset_work(struct work_struct *work)
goto out;
 
/*
-* If we're called to reset a live controller first shut it down before
-* moving on.
+* Ensure there are no timeout work in progress prior to forcefully
+* disabling the queue. There is no harm in disabling the device even
+* when it was already disabled, as this will forcefully reclaim any
+* IOs that are stuck due to blk-mq's timeout handling that prevents
+* timed out requests from completing.
 */
-   if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
-   nvme_dev_disable(dev, false);
+   nvme_sync_queues(>ctrl);
+   nvme_dev_disable(dev, false);
 
/*
 * Introduce CONNECTING state from nvme-fc/rdma transports to mark the
--

Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()

2018-05-21 Thread Roman Penyaev

On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenney
 wrote:
> On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote:
>> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney
>>  wrote:
>> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote:
>> >> Function is going to be used in transport over RDMA module
>> >> in subsequent patches.
>> >>
>> >> Function returns next element in round-robin fashion,
>> >> i.e. head will be skipped.  NULL will be returned if list
>> >> is observed as empty.
>> >>
>> >> Signed-off-by: Roman Pen 
>> >> Cc: Paul E. McKenney 
>> >> Cc: linux-ker...@vger.kernel.org
>> >> ---
>> >>  include/linux/rculist.h | 19 +++
>> >>  1 file changed, 19 insertions(+)
>> >>
>> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
>> >> index 127f534fec94..b0840d5ab25a 100644
>> >> --- a/include/linux/rculist.h
>> >> +++ b/include/linux/rculist.h
>> >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct 
>> >> list_head *list,
>> >>  })
>> >>
>> >>  /**
>> >> + * list_next_or_null_rr_rcu - get next list element in round-robin 
>> >> fashion.
>> >> + * @head:the head for the list.
>> >> + * @ptr:the list head to take the next element from.
>> >> + * @type:   the type of the struct this is embedded in.
>> >> + * @memb:   the name of the list_head within the struct.
>> >> + *
>> >> + * Next element returned in round-robin fashion, i.e. head will be 
>> >> skipped,
>> >> + * but if list is observed as empty, NULL will be returned.
>> >> + *
>> >> + * This primitive may safely run concurrently with the _rcu list-mutation
>> >> + * primitives such as list_add_rcu() as long as it's guarded by 
>> >> rcu_read_lock().
>> >
>> > Of course, all the set of list_next_or_null_rr_rcu() invocations that
>> > are round-robining a given list must all be under the same RCU read-side
>> > critical section.  For example, the following will break badly:
>> >
>> > struct foo *take_rr_step(struct list_head *head, struct foo *ptr)
>> > {
>> > struct foo *ret;
>> >
>> > rcu_read_lock();
>> > ret = list_next_or_null_rr_rcu(head, ptr, struct foo, 
>> > foolist);
>> > rcu_read_unlock();  /* BUG */
>> > return ret;
>> > }
>> >
>> > You need a big fat comment stating this, at the very least.  The resulting
>> > bug can be very hard to trigger and even harder to debug.
>> >
>> > And yes, I know that the same restriction applies to list_next_rcu()
>> > and friends.  The difference is that if you try to invoke those in an
>> > infinite loop, you will be rapped on the knuckles as soon as you hit
>> > the list header.  Without that knuckle-rapping, RCU CPU stall warnings
>> > might tempt people to do something broken like take_rr_step() above.
>>
>> Hi Paul,
>>
>> I need -rr behaviour for doing IO load-balancing when I choose next RDMA
>> connection from the list in order to send a request, i.e. my code is
>> something like the following:
>>
>> static struct conn *get_and_set_next_conn(void)
>> {
>> struct conn *conn;
>>
>> conn = rcu_dereferece(rcu_conn);
>> if (unlikely(!conn))
>> return conn;
>
> Wait.  Don't you need to restart from the beginning of the list in
> this case?  Or does the list never have anything added to it and is
> rcu_conn initially the first element in the list?

Hi Paul,

No, I continue from the pointer, which I assigned on the previous IO
in order to send IO fairly and keep load balanced.

Initially @rcu_conn points to the first element, but elements can
be deleted from the list and list can become empty.

The deletion code is below.

>
>> conn = list_next_or_null_rr_rcu(_list,
>> >entry,
>> typeof(*conn),
>> entry);
>> rcu_assign_pointer(rcu_conn, conn);
>
> Linus is correct to doubt this code.  You assign a pointer to the current
> element to rcu_conn, which is presumably a per-CPU or global variable.
> So far, so good ...

I use per-CPU, in the first example I did not show that not to overcomplicate
the code.

>
>> return conn;
>> }
>>
>> rcu_read_lock();
>> conn = get_and_set_next_conn();
>> if (unlikely(!conn)) {
>> /* ... */
>> }
>> err = rdma_io(conn, request);
>> rcu_read_unlock();
>
> ... except that some other CPU might well remove the entry referenced by
> rcu_conn at this point.  It would have to wait for a grace period (e.g.,
> synchronize_rcu()), but the current CPU has exited its RCU read-side
> critical section, and therefore is not blocking the grace

[PATCH] loop: clear wb_err in bd_inode when detaching backing file

2018-05-21 Thread Jeff Layton

From: Jeff Layton 

When a loop block device encounters a writeback error, that error will
get propagated to the bd_inode's wb_err field. If we then detach the
backing file from it, attach another and fsync it, we'll get back the
writeback error that we had from the previous backing file.

This is a bit of a grey area as POSIX doesn't cover loop devices, but it
is somewhat counterintuitive.

If we detach a backing file from the loopdev while there are still
unreported errors, take it as a sign that we're no longer interested in
the previous file, and clear out the wb_err in the loop blockdev.

Signed-off-by: Jeff Layton 
---
 drivers/block/loop.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5d4e31655d96..55cf554bc914 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1068,6 +1068,7 @@ static int loop_clr_fd(struct loop_device *lo)
if (bdev) {
bdput(bdev);
invalidate_bdev(bdev);
+   bdev->bd_inode->i_mapping->wb_err = 0;
}
set_capacity(lo->lo_disk, 0);
loop_sysfs_exit(lo);
-- 
2.17.0

Re: [PATCH] fs: clear writeback errors in inode_init_always

2018-05-21 Thread Jeff Layton

On Sun, 2018-05-20 at 15:41 -0400, Theodore Y. Ts'o wrote:
> On Sun, May 20, 2018 at 12:20:09PM -0700, Matthew Wilcox wrote:
> > Oh!  I misunderstood.  I thought that bd_inode->i_mapping pointed to
> > lo_backing_file->f_mapping.
> > 
> > So how about this to preserve the error _in the file with the error_?
> > 
> > if (bdev) {
> > bdput(bdev);
> > invalidate_bdev(bdev);
> > +   filp->f_mapping->wb_err = bdev->bd_inode->i_mapping->wb_err;
> > +   bdev->bd_inode->i_mapping->wb_err = 0;
> > }
> > set_capacity(lo->lo_disk, 0);
> > loop_sysfs_exit(lo);
> 
> I don't think it's necessary.  wb_err on the underlying file would
> have already been set when the error was set.  So if someone tried
> calling fsync(2) on the underlying file, it would have collected the
> error already.  Re-setting the error when we detach the loop device
> doesn't seem to make any sense.
> 

(cc'ing linux-block)

Yes, the error would have been set on the original file already. As long
as lo_req_flush or __loop_update_dio weren't called before we detach the
device, it should still be possible to call fsync on the original fd and
get back the error.

As a side note, it looks like __loop_update_dio will discard an error
from vfs_fsync, so certain ioctls against a loop device can cause errors
to be lost. It seems like those ought to get propagated to userland or
to the blockdev's mapping somehow.
-- 
Jeff Layton

Re: [PATCH] bdi: Fix oops in wb_workfn()

2018-05-21 Thread Jan Kara

On Sat 19-05-18 23:27:09, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > Jan Kara wrote:
> > > Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
> > > the necessary precautions against racing with bdi unregistration.
> > 
> > Yes, this patch will solve NULL pointer dereference bug. But is it OK to 
> > leave
> > list_empty(>work_list) == false situation? Who takes over the role of 
> > making
> > list_empty(>work_list) == true?
> 
> syzbot is again reporting the same NULL pointer dereference.
> 
>   general protection fault in wb_workfn (2)
>   
> https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206

Gaah... So we are still missing something.

> Didn't we overlook something obvious in commit b8b784958eccbf8f ("bdi:
> Fix oops in wb_workfn()") ?
> 
> At first, I thought that that commit will solve NULL pointer dereference bug.
> But what does
> 
>   if (!list_empty(>work_list))
> - mod_delayed_work(bdi_wq, >dwork, 0);
> + wb_wakeup(wb);
>   else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
>   wb_wakeup_delayed(wb);
> 
> mean?
> 
> static void wb_wakeup(struct bdi_writeback *wb)
> {
>   spin_lock_bh(>work_lock);
>   if (test_bit(WB_registered, >state))
>   mod_delayed_work(bdi_wq, >dwork, 0);
>   spin_unlock_bh(>work_lock);
> }
> 
> It means nothing but "we don't call mod_delayed_work() if WB_registered
> bit was already cleared".

Exactly.

> But if WB_registered bit is not yet cleared when we hit
> wb_wakeup_delayed() path?
> 
> void wb_wakeup_delayed(struct bdi_writeback *wb)
> {
>   unsigned long timeout;
> 
>   timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
>   spin_lock_bh(>work_lock);
>   if (test_bit(WB_registered, >state))
>   queue_delayed_work(bdi_wq, >dwork, timeout);
>   spin_unlock_bh(>work_lock);
> }
> 
> add_timer() is called because (presumably) timeout > 0. And after that
> timeout expires, __queue_work() is called even if WB_registered bit is
> already cleared before that timeout expires, isn't it?

Yes.

> void delayed_work_timer_fn(struct timer_list *t)
> {
>   struct delayed_work *dwork = from_timer(dwork, t, timer);
> 
>   /* should have been called from irqsafe timer with irq already off */
>   __queue_work(dwork->cpu, dwork->wq, >work);
> }
> 
> Then, wb_workfn() is after all scheduled even if we check for
> WB_registered bit, isn't it?

It can be queued after WB_registered bit is cleared but it cannot be queued
after mod_delayed_work(bdi_wq, >dwork, 0) has finished. That function
deletes the pending timer (the timer cannot be armed again because
WB_registered is cleared) and queues what should be the last round of
wb_workfn().

> Then, don't we need to check that
> 
>   mod_delayed_work(bdi_wq, >dwork, 0);
>   flush_delayed_work(>dwork);
> 
> is really waiting for completion? At least, shouldn't we try below debug
> output (not only for debugging this report but also generally desirable)?
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 7441bd9..ccec8cd 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -376,8 +376,10 @@ static void wb_shutdown(struct bdi_writeback *wb)
>* tells wb_workfn() that @wb is dying and its work_list needs to
>* be drained no matter what.
>*/
> - mod_delayed_work(bdi_wq, >dwork, 0);
> - flush_delayed_work(>dwork);
> + if (!mod_delayed_work(bdi_wq, >dwork, 0))
> + printk(KERN_WARNING "wb_shutdown: mod_delayed_work() failed\n");

false return from mod_delayed_work() just means that there was no timer
armed. That is a valid situation if there are no dirty data.

> + if (!flush_delayed_work(>dwork))
> + printk(KERN_WARNING "wb_shutdown: flush_delayed_work() 
> failed\n");

And this is valid as well (although unlikely) if the work managed to
complete on another CPU before flush_delayed_work() was called.

So I don't think your warnings will help us much. But yes, we need to debug
this somehow. For now I have no idea what could be still going wrong.

Honza
-- 
Jan Kara 
SUSE Labs, CR

Re: [PATCH blktests] Fix block/011 to not use sysfs for device disabling

2018-05-21 Thread Yi Zhang

Hi Keith
I tried this patch on my R730 Server, but it lead to system hang after setpci, 
could you help check it, thanks.

Console log:
storageqe-62 login: 
Kernel 4.17.0-rc5 on an x86_64

storageqe-62 login: [ 1058.118258] {1}[Hardware Error]: Hardware error from 
APEI Generic Hardware Error Source: 3
[ 1058.118261] {1}[Hardware Error]: event severity: fatal
[ 1058.118262] {1}[Hardware Error]:  Error 0, type: fatal
[ 1058.118265] {1}[Hardware Error]:   section_type: PCIe error
[ 1058.118266] {1}[Hardware Error]:   port_type: 0, PCIe end point
[ 1058.118267] {1}[Hardware Error]:   version: 1.16
[ 1058.118269] {1}[Hardware Error]:   command: 0x0400, status: 0x0010
[ 1058.118270] {1}[Hardware Error]:   device_id: :85:00.0
[ 1058.118271] {1}[Hardware Error]:   slot: 0
[ 1058.118271] {1}[Hardware Error]:   secondary_bus: 0x00
[ 1058.118273] {1}[Hardware Error]:   vendor_id: 0x144d, device_id: 0xa821
[ 1058.118274] {1}[Hardware Error]:   class_code: 020801
[ 1058.118275] Kernel panic - not syncing: Fatal hardware error!
[ 1058.118301] Kernel Offset: 0x1480 from 0x8100 (relocation 
range: 0x8000-0xbfff)


Best Regards,
  Yi Zhang


- Original Message -
From: "Keith Busch" 
To: "Omar Sandoval" , linux-block@vger.kernel.org, 
linux-n...@lists.infradead.org
Cc: "Johannes Thumshirn" , "Christoph Hellwig" 
, "Jens Axboe" , "Ming Lei" 
, "Keith Busch" 
Sent: Saturday, May 19, 2018 1:42:47 AM
Subject: [PATCH blktests] Fix block/011 to not use sysfs for device disabling

The PCI sysfs interface may not be a dependable method for toggling the
PCI device state to trigger the timeouts. This patch goes directly to
the config space to make device failure occur.

The success of this test is still senstive to timing, as it may disable
IO memory when a driver is trying to bring it online. This can look like
a permanent device failure from the driver's perspective.

Signed-off-by: Keith Busch 
---
 tests/block/011 | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/tests/block/011 b/tests/block/011
index 62e89f7..2fc0ffb 100755
--- a/tests/block/011
+++ b/tests/block/011
@@ -21,7 +21,7 @@ DESCRIPTION="disable PCI device while doing I/O"
 TIMED=1
 
 requires() {
-   _have_fio
+   _have_fio && _have_program setpci
 }
 
 device_requires() {
@@ -43,10 +43,11 @@ test_device() {
_run_fio_rand_io --filename="$TEST_DEV" --size="$size" \
--ignore_error=EIO,ENXIO,ENODEV &
 
+   # toggle PCI Command Register's Memory and Bus Master enabling
while kill -0 $! 2>/dev/null; do
-   echo 0 > "/sys/bus/pci/devices/${pdev}/enable"
+   setpci -s "${pdev}" 4.w=0:6
sleep .2
-   echo 1 > "/sys/bus/pci/devices/${pdev}/enable"
+   setpci -s "${pdev}" 4.w=6:6
sleep .2
done
 
-- 
2.14.3

Re: [PATCH v2 14/26] ibtrs: include client and server modules into kernel compilation

2018-05-21 Thread kbuild test robot

Hi Roman,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.17-rc6 next-20180517]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Roman-Pen/InfiniBand-Transport-IBTRS-and-Network-Block-Device-IBNBD/20180520-222445
config: m68k-allyesconfig (attached as .config)
compiler: m68k-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=m68k 

All errors (new ones prefixed by >>):

   drivers/mtd/nand/raw/nand_base.o: In function `nand_soft_waitrdy':
   nand_base.c:(.text+0x1022): undefined reference to `__udivdi3'
   drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.o: In function 
`ibtrs_clt_stats_wc_completion_to_str':
>> ibtrs-clt-stats.c:(.text+0x172): undefined reference to `__udivdi3'
   drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.o: In function 
`ibtrs_clt_stats_sg_list_distr_to_str':
   ibtrs-clt-stats.c:(.text+0x49c): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

89 matches

Mail list logo