Re: [PATCH v2 14/26] ibtrs: include client and server modules into kernel compilation
On Fri, May 18, 2018 at 03:04:01PM +0200, Roman Pen wrote: > Add IBTRS Makefile, Kconfig and also corresponding lines into upper > layer infiniband/ulp files. > > Signed-off-by: Roman Pen> Signed-off-by: Danil Kipnis > Cc: Jack Wang > --- > drivers/infiniband/Kconfig| 1 + > drivers/infiniband/ulp/Makefile | 1 + > drivers/infiniband/ulp/ibtrs/Kconfig | 20 > drivers/infiniband/ulp/ibtrs/Makefile | 15 +++ > 4 files changed, 37 insertions(+) > create mode 100644 drivers/infiniband/ulp/ibtrs/Kconfig > create mode 100644 drivers/infiniband/ulp/ibtrs/Makefile > > diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig > index ee270e065ba9..787bd286fb08 100644 > --- a/drivers/infiniband/Kconfig > +++ b/drivers/infiniband/Kconfig > @@ -94,6 +94,7 @@ source "drivers/infiniband/ulp/srpt/Kconfig" > > source "drivers/infiniband/ulp/iser/Kconfig" > source "drivers/infiniband/ulp/isert/Kconfig" > +source "drivers/infiniband/ulp/ibtrs/Kconfig" > > source "drivers/infiniband/ulp/opa_vnic/Kconfig" > source "drivers/infiniband/sw/rdmavt/Kconfig" > diff --git a/drivers/infiniband/ulp/Makefile b/drivers/infiniband/ulp/Makefile > index 437813c7b481..1c4f10dc8d49 100644 > --- a/drivers/infiniband/ulp/Makefile > +++ b/drivers/infiniband/ulp/Makefile > @@ -5,3 +5,4 @@ obj-$(CONFIG_INFINIBAND_SRPT) += srpt/ > obj-$(CONFIG_INFINIBAND_ISER)+= iser/ > obj-$(CONFIG_INFINIBAND_ISERT) += isert/ > obj-$(CONFIG_INFINIBAND_OPA_VNIC)+= opa_vnic/ > +obj-$(CONFIG_INFINIBAND_IBTRS) += ibtrs/ > diff --git a/drivers/infiniband/ulp/ibtrs/Kconfig > b/drivers/infiniband/ulp/ibtrs/Kconfig > new file mode 100644 > index ..eaeb8f3f6b4e > --- /dev/null > +++ b/drivers/infiniband/ulp/ibtrs/Kconfig > @@ -0,0 +1,20 @@ > +config INFINIBAND_IBTRS > + tristate > + depends on INFINIBAND_ADDR_TRANS > + > +config INFINIBAND_IBTRS_CLIENT > + tristate "IBTRS client module" > + depends on INFINIBAND_ADDR_TRANS > + select INFINIBAND_IBTRS > + help > + IBTRS client allows for simplified data transfer and connection > + establishment over RDMA (InfiniBand, RoCE, iWarp). Uses BIO-like > + READ/WRITE semantics and provides multipath capabilities. > + > +config INFINIBAND_IBTRS_SERVER > + tristate "IBTRS server module" > + depends on INFINIBAND_ADDR_TRANS > + select INFINIBAND_IBTRS > + help > + IBTRS server module processing connection and IO requests received > + from the IBTRS client module. > diff --git a/drivers/infiniband/ulp/ibtrs/Makefile > b/drivers/infiniband/ulp/ibtrs/Makefile > new file mode 100644 > index ..e6ea858745ad > --- /dev/null > +++ b/drivers/infiniband/ulp/ibtrs/Makefile > @@ -0,0 +1,15 @@ > +ibtrs-client-y := ibtrs-clt.o \ > + ibtrs-clt-stats.o \ > + ibtrs-clt-sysfs.o > + > +ibtrs-server-y := ibtrs-srv.o \ > + ibtrs-srv-stats.o \ > + ibtrs-srv-sysfs.o > + > +ibtrs-core-y := ibtrs.o > + > +obj-$(CONFIG_INFINIBAND_IBTRS)+= ibtrs-core.o Will it build ibtrs-core in case both server and client are disabled in .config? > +obj-$(CONFIG_INFINIBAND_IBTRS_CLIENT) += ibtrs-client.o > +obj-$(CONFIG_INFINIBAND_IBTRS_SERVER) += ibtrs-server.o > + > +-include $(src)/compat/compat.mk What is this? > -- > 2.13.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: PGP signature
Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
On May 21, 2018, at 9:47 PM, Ming Leiwrote: > >> On Mon, May 21, 2018 at 09:16:33PM -0600, Jens Axboe wrote: >>> On 5/21/18 8:49 PM, Ming Lei wrote: On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote: This patch simplifies the timeout handling by relying on the request reference counting to ensure the iterator is operating on an inflight >>> >>> The reference counting isn't free, what is the real benefit in this way? >> >> Neither is the current scheme and locking, and this is a hell of a lot >> simpler. I'd get rid of the kref stuff and just do a simple >> atomic_dec_and_test(). Most of the time we should be uncontended on >> that, which would make it pretty darn cheap. I'd be surprised if it >> wasn't better than the current alternatives. > > The explicit memory barriers by atomic_dec_and_test() isn't free. I’m not saying it’s free. Neither is our current synchronization. > Also the double completion issue need to be fixed before discussing > this approach further. Certainly. Also not saying that the current patch is perfect. But it’s a lot more palatable than the alternatives, hence my interest. And I’d like for this issue to get solved, we seem to be a bit stuck atm. I’ll take a closer look tomorrow.
Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
On Mon, May 21, 2018 at 09:16:33PM -0600, Jens Axboe wrote: > On 5/21/18 8:49 PM, Ming Lei wrote: > > On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote: > >> This patch simplifies the timeout handling by relying on the request > >> reference counting to ensure the iterator is operating on an inflight > > > > The reference counting isn't free, what is the real benefit in this way? > > Neither is the current scheme and locking, and this is a hell of a lot > simpler. I'd get rid of the kref stuff and just do a simple > atomic_dec_and_test(). Most of the time we should be uncontended on > that, which would make it pretty darn cheap. I'd be surprised if it > wasn't better than the current alternatives. The explicit memory barriers by atomic_dec_and_test() isn't free. Also the double completion issue need to be fixed before discussing this approach further. Thanks, Ming
Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
On 5/21/18 8:49 PM, Ming Lei wrote: > On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote: >> This patch simplifies the timeout handling by relying on the request >> reference counting to ensure the iterator is operating on an inflight > > The reference counting isn't free, what is the real benefit in this way? Neither is the current scheme and locking, and this is a hell of a lot simpler. I'd get rid of the kref stuff and just do a simple atomic_dec_and_test(). Most of the time we should be uncontended on that, which would make it pretty darn cheap. I'd be surprised if it wasn't better than the current alternatives. -- Jens Axboe
Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote: > This patch simplifies the timeout handling by relying on the request > reference counting to ensure the iterator is operating on an inflight The reference counting isn't free, what is the real benefit in this way? > and truly timed out request. Since the reference counting prevents the > tag from being reallocated, the block layer no longer needs to prevent > drivers from completing their requests while the timeout handler is > operating on it: a driver completing a request is allowed to proceed to > the next state without additional syncronization with the block layer. This might cause trouble for drivers, since the previous behaviour is that one request is only completed from one path, and now you change the behaviour. > > This also removes any need for generation sequence numbers since the > request lifetime is prevented from being reallocated as a new sequence > while timeout handling is operating on it. > > Signed-off-by: Keith Busch> --- > block/blk-core.c | 6 -- > block/blk-mq-debugfs.c | 1 - > block/blk-mq.c | 259 > ++--- > block/blk-mq.h | 20 +--- > block/blk-timeout.c| 1 - > include/linux/blkdev.h | 26 + > 6 files changed, 58 insertions(+), 255 deletions(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index 43370faee935..cee03cad99f2 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request > *rq) > rq->internal_tag = -1; > rq->start_time_ns = ktime_get_ns(); > rq->part = NULL; > - seqcount_init(>gstate_seq); > - u64_stats_init(>aborted_gstate_sync); > - /* > - * See comment of blk_mq_init_request > - */ > - WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC); > } > EXPORT_SYMBOL(blk_rq_init); > > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c > index 3080e18cb859..ffa622366922 100644 > --- a/block/blk-mq-debugfs.c > +++ b/block/blk-mq-debugfs.c > @@ -344,7 +344,6 @@ static const char *const rqf_name[] = { > RQF_NAME(STATS), > RQF_NAME(SPECIAL_PAYLOAD), > RQF_NAME(ZONE_WRITE_LOCKED), > - RQF_NAME(MQ_TIMEOUT_EXPIRED), > RQF_NAME(MQ_POLL_SLEPT), > }; > #undef RQF_NAME > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 66e5c768803f..4858876fd364 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -589,56 +589,6 @@ static void __blk_mq_complete_request(struct request *rq) > put_cpu(); > } > > -static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx) > - __releases(hctx->srcu) > -{ > - if (!(hctx->flags & BLK_MQ_F_BLOCKING)) > - rcu_read_unlock(); > - else > - srcu_read_unlock(hctx->srcu, srcu_idx); > -} > - > -static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx) > - __acquires(hctx->srcu) > -{ > - if (!(hctx->flags & BLK_MQ_F_BLOCKING)) { > - /* shut up gcc false positive */ > - *srcu_idx = 0; > - rcu_read_lock(); > - } else > - *srcu_idx = srcu_read_lock(hctx->srcu); > -} > - > -static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate) > -{ > - unsigned long flags; > - > - /* > - * blk_mq_rq_aborted_gstate() is used from the completion path and > - * can thus be called from irq context. u64_stats_fetch in the > - * middle of update on the same CPU leads to lockup. Disable irq > - * while updating. > - */ > - local_irq_save(flags); > - u64_stats_update_begin(>aborted_gstate_sync); > - rq->aborted_gstate = gstate; > - u64_stats_update_end(>aborted_gstate_sync); > - local_irq_restore(flags); > -} > - > -static u64 blk_mq_rq_aborted_gstate(struct request *rq) > -{ > - unsigned int start; > - u64 aborted_gstate; > - > - do { > - start = u64_stats_fetch_begin(>aborted_gstate_sync); > - aborted_gstate = rq->aborted_gstate; > - } while (u64_stats_fetch_retry(>aborted_gstate_sync, start)); > - > - return aborted_gstate; > -} > - > /** > * blk_mq_complete_request - end I/O on a request > * @rq: the request being processed > @@ -650,27 +600,10 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq) > void blk_mq_complete_request(struct request *rq) > { > struct request_queue *q = rq->q; > - struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu); > - int srcu_idx; > > if (unlikely(blk_should_fake_timeout(q))) > return; > - > - /* > - * If @rq->aborted_gstate equals the current instance, timeout is > - * claiming @rq and we lost. This is synchronized through > - * hctx_lock(). See blk_mq_timeout_work() for details. > - * > - * Completion path never blocks and we can directly use RCU here > - * instead of
Re: [RFC PATCH 2/3] blk-mq: Fix timeout and state order
On Mon, May 21, 2018 at 05:11:30PM -0600, Keith Busch wrote: > The block layer had been setting the state to in-flight prior to updating > the timer. This is the wrong order since the timeout handler could observe > the in-flight state with the older timeout, believing the request had > expired when in fact it is just getting started. > > Signed-off-by: Keith Busch> --- > block/blk-mq.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 8b370ed75605..66e5c768803f 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -713,8 +713,8 @@ void blk_mq_start_request(struct request *rq) > preempt_disable(); > write_seqcount_begin(>gstate_seq); > > - blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); > blk_add_timer(rq); > + blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); > > write_seqcount_end(>gstate_seq); > preempt_enable(); > -- > 2.14.3 > Looks fine, Reviewed-by: Ming Lei Thanks, Ming
Re: [RFC PATCH 1/3] blk-mq: Reference count request usage
On Mon, May 21, 2018 at 05:11:29PM -0600, Keith Busch wrote: > This patch adds a struct kref to struct request so that request users > can be sure they're operating on the same request without it changing > while they're processing it. The request's tag won't be released for > reuse until the last user is done with it. > > Signed-off-by: Keith Busch> --- > block/blk-mq.c | 30 +++--- > include/linux/blkdev.h | 2 ++ > 2 files changed, 25 insertions(+), 7 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 4cbfd784e837..8b370ed75605 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -332,6 +332,7 @@ static struct request *blk_mq_rq_ctx_init(struct > blk_mq_alloc_data *data, > #endif > > data->ctx->rq_dispatched[op_is_sync(op)]++; > + kref_init(>ref); > return rq; > } > > @@ -465,13 +466,33 @@ struct request *blk_mq_alloc_request_hctx(struct > request_queue *q, > } > EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx); > > +static void blk_mq_exit_request(struct kref *ref) > +{ > + struct request *rq = container_of(ref, struct request, ref); > + struct request_queue *q = rq->q; > + struct blk_mq_ctx *ctx = rq->mq_ctx; > + struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); > + const int sched_tag = rq->internal_tag; > + > + if (rq->tag != -1) > + blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); > + if (sched_tag != -1) > + blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag); > + blk_mq_sched_restart(hctx); > + blk_queue_exit(q); > +} > + > +static void blk_mq_put_request(struct request *rq) > +{ > + kref_put(>ref, blk_mq_exit_request); > +} > + > void blk_mq_free_request(struct request *rq) > { > struct request_queue *q = rq->q; > struct elevator_queue *e = q->elevator; > struct blk_mq_ctx *ctx = rq->mq_ctx; > struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); > - const int sched_tag = rq->internal_tag; > > if (rq->rq_flags & RQF_ELVPRIV) { > if (e && e->type->ops.mq.finish_request) > @@ -495,12 +516,7 @@ void blk_mq_free_request(struct request *rq) > blk_put_rl(blk_rq_rl(rq)); > > blk_mq_rq_update_state(rq, MQ_RQ_IDLE); > - if (rq->tag != -1) > - blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); > - if (sched_tag != -1) > - blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag); > - blk_mq_sched_restart(hctx); > - blk_queue_exit(q); > + blk_mq_put_request(rq); Both the above line(atomic_try_cmpxchg_release is implied) and kref_init() in blk_mq_rq_ctx_init() are run from fast path, and may introduce some cost, you may have to run some benchmark to show if there is effect. Also given the cost isn't free, could you describe a bit in comment log what we can get with the cost? Thanks, Ming
Re: [PATCH 1/6] nvme: Sync request queues on reset
On Mon, May 21, 2018 at 10:25:17AM -0600, Keith Busch wrote: > On Tue, May 22, 2018 at 12:08:37AM +0800, Ming Lei wrote: > > Please take a look at blk_mq_complete_request(). Even with Bart's > > change, the request still won't be completed by driver. The request can > > only be completed by either driver or blk-mq, not both. > > So you're saying blk-mq can't complete a request the driver returned to > blk-mq to complete. And that's the nvme driver's problem to fix? For avoiding use-after-free, one request can only be completed by one path, either by timeout path or normal completion(irq or cancel) from driver. So before handling this req's timeout, this request has to be marked as completed by blk-mq timeout code already, then nvme_cancel_request() can't cover this timed-out request. Thanks, Ming
Re: [PATCH 3/6] nvme: Move all IO out of controller reset
On Mon, May 21, 2018 at 10:23:55AM -0600, Keith Busch wrote: > On Tue, May 22, 2018 at 12:04:53AM +0800, Ming Lei wrote: > > On Mon, May 21, 2018 at 09:44:33AM -0600, Keith Busch wrote: > > > On Mon, May 21, 2018 at 11:34:27PM +0800, Ming Lei wrote: > > > > nvme_dev_disable() quiesces queues first before killing queues. > > > > > > > > If queues are quiesced during or before nvme_wait_freeze() is run > > > > from the 2nd part of reset, the 2nd part can't move on, and IO hang > > > > is caused. Finally no reset can be scheduled at all. > > > > > > But this patch moves nvme_wait_freeze outside the reset path, so I'm > > > afraid I'm unable to follow how you've concluded the wait freeze is > > > somehow part of the reset. > > > > For example: > > > > 1) the 1st timeout event: > > > > - nvme_dev_disable() > > - reset > > - scan_work > > > > 2) the 2nd timeout event: > > > > nvme_dev_disable() may come just after nvme_start_queues() in > > the above reset of the 1st timeout. And nvme_timeout() won't > > schedule a new reset since the controller state is NVME_CTRL_CONNECTING. > > Let me get this straight -- you're saying nvme_start_queues is going > to somehow immediately trigger timeout work? I can't see how that could It may be difficult to trigger in reality, but won't be impossible since the timeout value can be adjusted via module parameter, and any schedule delay can be introduced in one busy system. It isn't a good practice to rely on timing for avoiding race, IMO. > possibly happen in real life, but we can just remove it and use the existing > nvme_start_ctrl to handle that in the LIVE state. OK, please fix other issues mentioned in the following comment together, then I will review further and see if it can work well. https://marc.info/?l=linux-block=152668465710591=2 Thanks, Ming
Re: [PATCH 16/34] iomap: add initial support for writes without buffer heads
On Mon, May 21, 2018 at 04:27:00PM -0700, Darrick J. Wong wrote: > On Fri, May 18, 2018 at 06:48:12PM +0200, Christoph Hellwig wrote: > > For now just limited to blocksize == PAGE_SIZE, where we can simply read > > in the full page in write begin, and just set the whole page dirty after > > copying data into it. This code is enabled by default and XFS will now > > be feed pages without buffer heads in ->writepage and ->writepages. > > > > If a file system sets the IOMAP_F_BUFFER_HEAD flag on the iomap the old > > path will still be used, this both helps the transition in XFS and > > prepares for the gfs2 migration to the iomap infrastructure. > > > > Signed-off-by: Christoph Hellwig> > --- > > fs/iomap.c| 132 ++ > > fs/xfs/xfs_iomap.c| 6 +- > > include/linux/iomap.h | 2 + > > 3 files changed, 127 insertions(+), 13 deletions(-) > > > > diff --git a/fs/iomap.c b/fs/iomap.c > > index 821671af2618..cd4c563db80a 100644 > > --- a/fs/iomap.c > > +++ b/fs/iomap.c > > @@ -314,6 +314,58 @@ iomap_write_failed(struct inode *inode, loff_t pos, > > unsigned len) > > truncate_pagecache_range(inode, max(pos, i_size), pos + len); > > } > > > > +static int > > +iomap_read_page_sync(struct inode *inode, loff_t block_start, struct page > > *page, > > + unsigned poff, unsigned plen, struct iomap *iomap) > > +{ > > + struct bio_vec bvec; > > + struct bio bio; > > + int ret; > > + > > + bio_init(, , 1); > > + bio.bi_opf = REQ_OP_READ; > > + bio.bi_iter.bi_sector = iomap_sector(iomap, block_start); > > + bio_set_dev(, iomap->bdev); > > + __bio_add_page(, page, plen, poff); > > + ret = submit_bio_wait(); > > + if (ret < 0 && iomap_block_needs_zeroing(inode, block_start, iomap)) > > + zero_user(page, poff, plen); > > + return ret; > > +} > > + > > +static int > > +__iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, > > + struct page *page, struct iomap *iomap) > > +{ > > + loff_t block_size = i_blocksize(inode); > > + loff_t block_start = pos & ~(block_size - 1); > > + loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1); > > + unsigned poff = block_start & (PAGE_SIZE - 1); > > + unsigned plen = min_t(loff_t, PAGE_SIZE - poff, block_end - > > block_start); > > + int status; > > + > > + WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE); > > + > > + if (PageUptodate(page)) > > + return 0; > > + > > + if (iomap_block_needs_zeroing(inode, block_start, iomap)) { > > + unsigned from = pos & (PAGE_SIZE - 1), to = from + len; > > + unsigned pend = poff + plen; > > + > > + if (poff < from || pend > to) > > + zero_user_segments(page, poff, from, to, pend); > > + } else { > > + status = iomap_read_page_sync(inode, block_start, page, > > + poff, plen, iomap); > > Something doesn't smell right here. The only pages we need to read in > are the first and last pages in the write_begin range, and only if they > aren't page aligned and the underlying extent is IOMAP_MAPPED, right? And not beyond EOF, too. The bufferhead code handles this via the buffer_new() flag - it triggers the skipping of read IO and the states in which it is set are clearly indicated in iomap_to_bh(). That same logic needs to apply here. > I also noticed that speculative preallocation kicks in by the second 80M > write() call and writeback for the second call can successfully allocate > the entire preallocation, which means that the third (or nth) write call > can have a real extent already mapped in, and then we end up reading it. Yeah, that's because there's no check against EOF here. These writes are all beyond EOF, so there shouldn't be any read at all... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()
On Mon, May 21, 2018 at 02:24:32PM -0400, Mike Snitzer wrote: > Every single data structure change in this series should be reviewed for > unforeseen alignment consequences. Jens seemed to say that is > worthwhile. Not sure if he'll do it or we divide it up. If we divide > it up a temp topic branch should be published for others to inspect. > > Could be alignment hasn't been a historic concern for a bunch of the > data structures changed in this series.. if so then all we can do is fix > up any obvious potential for false sharing. Honestly, I almost never worry about alignment... the very few times I do care, I use __cacheline_aligned_in_smp. If alignment is a concern in any of those structs, there really ought to be a comment indicating it. I very much doubt anything I touched was performance sensitive enough for it to be an issue, though. And if there is a performance impact, it should be oughtweighed by the reduced pointer chasing. If you disagree, I don't mind leaving the device mapper patch out, it really makes no difference to me. I could glance over for alignment issues but I feel like my analysis would not be terribly valuable to you considering I've already said my position on alignment is "meh, don't care" :)
Re: [RFC PATCH 0/3] blk-mq: Timeout rework
On Mon, 2018-05-21 at 17:11 -0600, Keith Busch wrote: > The current blk-mq code potentially locks requests out of completion by > the thousands, making drivers jump through hoops to handle them. This > patch set allows drivers to complete their requests whenever they're > completed without requiring drivers know anything about the timeout code > with minimal syncronization. > > Other proposals under current consideration still have moments that > prevent a driver from progressing a request to the completed state. > > The timeout is ultimatley made safe by reference counting the request > when timeout handling claims the request. By holding the reference count, > we don't need to do any tricks to prevent a driver from completing the > request out from under the timeout handler, allowing the actual state > to be changed inline with the true state, and drivers don't need to be > aware any of this is happening. > > In order to make the overhead as minimal as possible, the request's > reference is taken only when it appears that actual timeout handling > needs to be done. Hello Keith, Can you explain why the NVMe driver needs reference counting of requests but no other block driver needs this? Additionally, why is it that for all block drivers except NVMe the current block layer API is sufficient (blk_get_request()/blk_execute_rq()/blk_mq_start_request()/ blk_mq_complete_request()/blk_mq_end_request())? Thanks, Bart.
Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce
On Mon, 2018-05-21 at 17:11 -0600, Keith Busch wrote: > @@ -650,27 +600,10 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq) > void blk_mq_complete_request(struct request *rq) > { > struct request_queue *q = rq->q; > - struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu); > - int srcu_idx; > > if (unlikely(blk_should_fake_timeout(q))) > return; > - [ ... ] > + __blk_mq_complete_request(rq); > } > EXPORT_SYMBOL(blk_mq_complete_request); > [ ... ] > static void blk_mq_rq_timed_out(struct request *req, bool reserved) > { > const struct blk_mq_ops *ops = req->q->mq_ops; > enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER; > > - req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED; > - > if (ops->timeout) > ret = ops->timeout(req, reserved); > > switch (ret) { > case BLK_EH_HANDLED: > - __blk_mq_complete_request(req); > - break; > - case BLK_EH_RESET_TIMER: > /* > - * As nothing prevents from completion happening while > - * ->aborted_gstate is set, this may lead to ignored > - * completions and further spurious timeouts. > + * If the request is still in flight, the driver is requesting > + * blk-mq complete it. >*/ > - blk_mq_rq_update_aborted_gstate(req, 0); > + if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT) > + __blk_mq_complete_request(req); > + break; > + case BLK_EH_RESET_TIMER: > blk_add_timer(req); > break; > case BLK_EH_NOT_HANDLED: > @@ -880,64 +782,64 @@ static void blk_mq_rq_timed_out(struct request *req, > bool reserved) > } > } I think the above changes can lead to concurrent calls of __blk_mq_complete_request() from the regular completion path and the timeout path. That's wrong: the __blk_mq_complete_request() caller should guarantee that no concurrent calls from another context to that function can occur. Bart.
Re: [PATCH 16/34] iomap: add initial support for writes without buffer heads
On Fri, May 18, 2018 at 06:48:12PM +0200, Christoph Hellwig wrote: > For now just limited to blocksize == PAGE_SIZE, where we can simply read > in the full page in write begin, and just set the whole page dirty after > copying data into it. This code is enabled by default and XFS will now > be feed pages without buffer heads in ->writepage and ->writepages. > > If a file system sets the IOMAP_F_BUFFER_HEAD flag on the iomap the old > path will still be used, this both helps the transition in XFS and > prepares for the gfs2 migration to the iomap infrastructure. > > Signed-off-by: Christoph Hellwig> --- > fs/iomap.c| 132 ++ > fs/xfs/xfs_iomap.c| 6 +- > include/linux/iomap.h | 2 + > 3 files changed, 127 insertions(+), 13 deletions(-) > > diff --git a/fs/iomap.c b/fs/iomap.c > index 821671af2618..cd4c563db80a 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -314,6 +314,58 @@ iomap_write_failed(struct inode *inode, loff_t pos, > unsigned len) > truncate_pagecache_range(inode, max(pos, i_size), pos + len); > } > > +static int > +iomap_read_page_sync(struct inode *inode, loff_t block_start, struct page > *page, > + unsigned poff, unsigned plen, struct iomap *iomap) > +{ > + struct bio_vec bvec; > + struct bio bio; > + int ret; > + > + bio_init(, , 1); > + bio.bi_opf = REQ_OP_READ; > + bio.bi_iter.bi_sector = iomap_sector(iomap, block_start); > + bio_set_dev(, iomap->bdev); > + __bio_add_page(, page, plen, poff); > + ret = submit_bio_wait(); > + if (ret < 0 && iomap_block_needs_zeroing(inode, block_start, iomap)) > + zero_user(page, poff, plen); > + return ret; > +} > + > +static int > +__iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, > + struct page *page, struct iomap *iomap) > +{ > + loff_t block_size = i_blocksize(inode); > + loff_t block_start = pos & ~(block_size - 1); > + loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1); > + unsigned poff = block_start & (PAGE_SIZE - 1); > + unsigned plen = min_t(loff_t, PAGE_SIZE - poff, block_end - > block_start); > + int status; > + > + WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE); > + > + if (PageUptodate(page)) > + return 0; > + > + if (iomap_block_needs_zeroing(inode, block_start, iomap)) { > + unsigned from = pos & (PAGE_SIZE - 1), to = from + len; > + unsigned pend = poff + plen; > + > + if (poff < from || pend > to) > + zero_user_segments(page, poff, from, to, pend); > + } else { > + status = iomap_read_page_sync(inode, block_start, page, > + poff, plen, iomap); Something doesn't smell right here. The only pages we need to read in are the first and last pages in the write_begin range, and only if they aren't page aligned and the underlying extent is IOMAP_MAPPED, right? I also noticed that speculative preallocation kicks in by the second 80M write() call and writeback for the second call can successfully allocate the entire preallocation, which means that the third (or nth) write call can have a real extent already mapped in, and then we end up reading it. --D > + if (status < 0) > + return status; > + SetPageUptodate(page); > + } > + > + return 0; > +} > + > static int > iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned > flags, > struct page **pagep, struct iomap *iomap) > @@ -331,7 +383,10 @@ iomap_write_begin(struct inode *inode, loff_t pos, > unsigned len, unsigned flags, > if (!page) > return -ENOMEM; > > - status = __block_write_begin_int(page, pos, len, NULL, iomap); > + if (iomap->flags & IOMAP_F_BUFFER_HEAD) > + status = __block_write_begin_int(page, pos, len, NULL, iomap); > + else > + status = __iomap_write_begin(inode, pos, len, page, iomap); > if (unlikely(status)) { > unlock_page(page); > put_page(page); > @@ -344,14 +399,63 @@ iomap_write_begin(struct inode *inode, loff_t pos, > unsigned len, unsigned flags, > return status; > } > > +int > +iomap_set_page_dirty(struct page *page) > +{ > + struct address_space *mapping = page_mapping(page); > + int newly_dirty; > + > + if (unlikely(!mapping)) > + return !TestSetPageDirty(page); > + > + /* > + * Lock out page->mem_cgroup migration to keep PageDirty > + * synchronized with per-memcg dirty page counters. > + */ > + lock_page_memcg(page); > + newly_dirty = !TestSetPageDirty(page); > + if (newly_dirty) > + __set_page_dirty(page, mapping, 0); > + unlock_page_memcg(page); > + > + if (newly_dirty) > + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); > +
[RFC PATCH 0/3] blk-mq: Timeout rework
The current blk-mq code potentially locks requests out of completion by the thousands, making drivers jump through hoops to handle them. This patch set allows drivers to complete their requests whenever they're completed without requiring drivers know anything about the timeout code with minimal syncronization. Other proposals under current consideration still have moments that prevent a driver from progressing a request to the completed state. The timeout is ultimatley made safe by reference counting the request when timeout handling claims the request. By holding the reference count, we don't need to do any tricks to prevent a driver from completing the request out from under the timeout handler, allowing the actual state to be changed inline with the true state, and drivers don't need to be aware any of this is happening. In order to make the overhead as minimal as possible, the request's reference is taken only when it appears that actual timeout handling needs to be done. Keith Busch (3): blk-mq: Reference count request usage blk-mq: Fix timeout and state order blk-mq: Remove generation seqeunce block/blk-core.c | 6 - block/blk-mq-debugfs.c | 1 - block/blk-mq.c | 291 + block/blk-mq.h | 20 +--- block/blk-timeout.c| 1 - include/linux/blkdev.h | 26 + 6 files changed, 83 insertions(+), 262 deletions(-) -- 2.14.3
[RFC PATCH 2/3] blk-mq: Fix timeout and state order
The block layer had been setting the state to in-flight prior to updating the timer. This is the wrong order since the timeout handler could observe the in-flight state with the older timeout, believing the request had expired when in fact it is just getting started. Signed-off-by: Keith Busch--- block/blk-mq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 8b370ed75605..66e5c768803f 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -713,8 +713,8 @@ void blk_mq_start_request(struct request *rq) preempt_disable(); write_seqcount_begin(>gstate_seq); - blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); blk_add_timer(rq); + blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); write_seqcount_end(>gstate_seq); preempt_enable(); -- 2.14.3
[RFC PATCH 3/3] blk-mq: Remove generation seqeunce
This patch simplifies the timeout handling by relying on the request reference counting to ensure the iterator is operating on an inflight and truly timed out request. Since the reference counting prevents the tag from being reallocated, the block layer no longer needs to prevent drivers from completing their requests while the timeout handler is operating on it: a driver completing a request is allowed to proceed to the next state without additional syncronization with the block layer. This also removes any need for generation sequence numbers since the request lifetime is prevented from being reallocated as a new sequence while timeout handling is operating on it. Signed-off-by: Keith Busch--- block/blk-core.c | 6 -- block/blk-mq-debugfs.c | 1 - block/blk-mq.c | 259 ++--- block/blk-mq.h | 20 +--- block/blk-timeout.c| 1 - include/linux/blkdev.h | 26 + 6 files changed, 58 insertions(+), 255 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 43370faee935..cee03cad99f2 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request *rq) rq->internal_tag = -1; rq->start_time_ns = ktime_get_ns(); rq->part = NULL; - seqcount_init(>gstate_seq); - u64_stats_init(>aborted_gstate_sync); - /* -* See comment of blk_mq_init_request -*/ - WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC); } EXPORT_SYMBOL(blk_rq_init); diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 3080e18cb859..ffa622366922 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -344,7 +344,6 @@ static const char *const rqf_name[] = { RQF_NAME(STATS), RQF_NAME(SPECIAL_PAYLOAD), RQF_NAME(ZONE_WRITE_LOCKED), - RQF_NAME(MQ_TIMEOUT_EXPIRED), RQF_NAME(MQ_POLL_SLEPT), }; #undef RQF_NAME diff --git a/block/blk-mq.c b/block/blk-mq.c index 66e5c768803f..4858876fd364 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -589,56 +589,6 @@ static void __blk_mq_complete_request(struct request *rq) put_cpu(); } -static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx) - __releases(hctx->srcu) -{ - if (!(hctx->flags & BLK_MQ_F_BLOCKING)) - rcu_read_unlock(); - else - srcu_read_unlock(hctx->srcu, srcu_idx); -} - -static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx) - __acquires(hctx->srcu) -{ - if (!(hctx->flags & BLK_MQ_F_BLOCKING)) { - /* shut up gcc false positive */ - *srcu_idx = 0; - rcu_read_lock(); - } else - *srcu_idx = srcu_read_lock(hctx->srcu); -} - -static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate) -{ - unsigned long flags; - - /* -* blk_mq_rq_aborted_gstate() is used from the completion path and -* can thus be called from irq context. u64_stats_fetch in the -* middle of update on the same CPU leads to lockup. Disable irq -* while updating. -*/ - local_irq_save(flags); - u64_stats_update_begin(>aborted_gstate_sync); - rq->aborted_gstate = gstate; - u64_stats_update_end(>aborted_gstate_sync); - local_irq_restore(flags); -} - -static u64 blk_mq_rq_aborted_gstate(struct request *rq) -{ - unsigned int start; - u64 aborted_gstate; - - do { - start = u64_stats_fetch_begin(>aborted_gstate_sync); - aborted_gstate = rq->aborted_gstate; - } while (u64_stats_fetch_retry(>aborted_gstate_sync, start)); - - return aborted_gstate; -} - /** * blk_mq_complete_request - end I/O on a request * @rq:the request being processed @@ -650,27 +600,10 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq) void blk_mq_complete_request(struct request *rq) { struct request_queue *q = rq->q; - struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu); - int srcu_idx; if (unlikely(blk_should_fake_timeout(q))) return; - - /* -* If @rq->aborted_gstate equals the current instance, timeout is -* claiming @rq and we lost. This is synchronized through -* hctx_lock(). See blk_mq_timeout_work() for details. -* -* Completion path never blocks and we can directly use RCU here -* instead of hctx_lock() which can be either RCU or SRCU. -* However, that would complicate paths which want to synchronize -* against us. Let stay in sync with the issue path so that -* hctx_lock() covers both issue and completion paths. -*/ - hctx_lock(hctx, _idx); - if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) - __blk_mq_complete_request(rq); - hctx_unlock(hctx, srcu_idx); +
[RFC PATCH 1/3] blk-mq: Reference count request usage
This patch adds a struct kref to struct request so that request users can be sure they're operating on the same request without it changing while they're processing it. The request's tag won't be released for reuse until the last user is done with it. Signed-off-by: Keith Busch--- block/blk-mq.c | 30 +++--- include/linux/blkdev.h | 2 ++ 2 files changed, 25 insertions(+), 7 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 4cbfd784e837..8b370ed75605 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -332,6 +332,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, #endif data->ctx->rq_dispatched[op_is_sync(op)]++; + kref_init(>ref); return rq; } @@ -465,13 +466,33 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, } EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx); +static void blk_mq_exit_request(struct kref *ref) +{ + struct request *rq = container_of(ref, struct request, ref); + struct request_queue *q = rq->q; + struct blk_mq_ctx *ctx = rq->mq_ctx; + struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); + const int sched_tag = rq->internal_tag; + + if (rq->tag != -1) + blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); + if (sched_tag != -1) + blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag); + blk_mq_sched_restart(hctx); + blk_queue_exit(q); +} + +static void blk_mq_put_request(struct request *rq) +{ + kref_put(>ref, blk_mq_exit_request); +} + void blk_mq_free_request(struct request *rq) { struct request_queue *q = rq->q; struct elevator_queue *e = q->elevator; struct blk_mq_ctx *ctx = rq->mq_ctx; struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); - const int sched_tag = rq->internal_tag; if (rq->rq_flags & RQF_ELVPRIV) { if (e && e->type->ops.mq.finish_request) @@ -495,12 +516,7 @@ void blk_mq_free_request(struct request *rq) blk_put_rl(blk_rq_rl(rq)); blk_mq_rq_update_state(rq, MQ_RQ_IDLE); - if (rq->tag != -1) - blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); - if (sched_tag != -1) - blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag); - blk_mq_sched_restart(hctx); - blk_queue_exit(q); + blk_mq_put_request(rq); } EXPORT_SYMBOL_GPL(blk_mq_free_request); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index f3999719f828..26bf2c1e3502 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -257,6 +257,8 @@ struct request { struct u64_stats_sync aborted_gstate_sync; u64 aborted_gstate; + struct kref ref; + /* access through blk_rq_set_deadline, blk_rq_deadline */ unsigned long __deadline; -- 2.14.3
Re: INFO: task hung in blk_queue_enter
Bart Van Assche wrote: > On Wed, 2018-05-16 at 17:16 +0200, Dmitry Vyukov wrote: > > On Wed, May 16, 2018 at 4:56 PM, Bart Van Assche> > wrote: > > > On Wed, 2018-05-16 at 22:05 +0900, Tetsuo Handa wrote: > > > > diff --git a/block/blk-core.c b/block/blk-core.c > > > > index 85909b4..59e2496 100644 > > > > --- a/block/blk-core.c > > > > +++ b/block/blk-core.c > > > > @@ -951,10 +951,10 @@ int blk_queue_enter(struct request_queue *q, > > > > blk_mq_req_flags_t flags) > > > > smp_rmb(); > > > > > > > > wait_event(q->mq_freeze_wq, > > > > -(atomic_read(>mq_freeze_depth) == 0 && > > > > - (preempt || !blk_queue_preempt_only(q))) || > > > > +atomic_read(>mq_freeze_depth) || > > > > +(preempt || !blk_queue_preempt_only(q)) || > > > > blk_queue_dying(q)); > > > > - if (blk_queue_dying(q)) > > > > + if (atomic_read(>mq_freeze_depth) || > > > > blk_queue_dying(q)) > > > > return -ENODEV; > > > > } > > > > } > > > > > > That change looks wrong to me. > > > > Hi Bart, > > > > Why does it look wrong to you? > > Because that change conflicts with the purpose of queue freezing and also > because > that change would inject I/O errors in code paths that shouldn't inject I/O > errors. But waiting there until atomic_read(>mq_freeze_depth) becomes 0 is causing deadlock. wait_event() never returns is a bug. I think we should not wait for q->mq_freeze_depth.
Re: [PATCH v5 4/5] fs: blkdev set bio prio from kiocb prio
On 5/21/18 2:04 PM, Jeff Moyer wrote: > adam.manzana...@wdc.com writes: > >> From: Adam Manzanares>> >> Now that kiocb has an ioprio field copy this over to the bio when it is >> created from the kiocb. >> >> Signed-off-by: Adam Manzanares >> --- >> fs/block_dev.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/fs/block_dev.c b/fs/block_dev.c >> index 7ec920e27065..da1e94d2bb75 100644 >> --- a/fs/block_dev.c >> +++ b/fs/block_dev.c >> @@ -355,6 +355,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter >> *iter, int nr_pages) >> bio->bi_write_hint = iocb->ki_hint; >> bio->bi_private = dio; >> bio->bi_end_io = blkdev_bio_end_io; >> +bio->bi_ioprio = iocb->ki_ioprio; >> >> ret = bio_iov_iter_get_pages(bio, iter); >> if (unlikely(ret)) { > > Forgot to mention, you should also convert __blkdev_direct_IO_simple. NP. > > -Jeff >
Re: [PATCH v5 0/5] AIO add per-command iopriority
On 5/21/18 1:57 PM, Jeff Moyer wrote: > Hi, Adam, > > adam.manzana...@wdc.com writes: > >> From: Adam Manzanares>> >> This is the per-I/O equivalent of the ioprio_set system call. >> See the following link for performance implications on a SATA HDD: >> https://lkml.org/lkml/2016/12/6/495 >> >> First patch factors ioprio_check_cap function out of ioprio_set system call >> to >> also be used by the aio ioprio interface. >> >> Second patch converts kiocb ki_hint field to a u16 to avoid kiocb bloat. >> >> Third patch passes ioprio hint from aio iocb to kiocb and inititalizes kiocb >> ioprio value appropriately when it is not explicitly set. >> >> Fourth patch enables the feature for blkdev. >> >> Fifth patch enables the feature for iomap direct IO >> >> Note: this work is based on top of linux-vfs/for-next > > I'll cook up a libaio test case. Can you put together a man-pages > update for this? Hello Jeff, I'll send out a man-pages update soon. Thanks, Adam > > Thanks! > Jeff >
Re: [PATCH v5 4/5] fs: blkdev set bio prio from kiocb prio
adam.manzana...@wdc.com writes: > From: Adam Manzanares> > Now that kiocb has an ioprio field copy this over to the bio when it is > created from the kiocb. > > Signed-off-by: Adam Manzanares > --- > fs/block_dev.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 7ec920e27065..da1e94d2bb75 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -355,6 +355,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter > *iter, int nr_pages) > bio->bi_write_hint = iocb->ki_hint; > bio->bi_private = dio; > bio->bi_end_io = blkdev_bio_end_io; > + bio->bi_ioprio = iocb->ki_ioprio; > > ret = bio_iov_iter_get_pages(bio, iter); > if (unlikely(ret)) { Forgot to mention, you should also convert __blkdev_direct_IO_simple. -Jeff
Re: [PATCH v5 1/5] block: add ioprio_check_cap function
adam.manzana...@wdc.com writes: > From: Adam Manzanares> > Aio per command iopriority support introduces a second interface between > userland and the kernel capable of passing iopriority. The aio interface also > needs the ability to verify that the submitting context has sufficient > priviledges to submit IOPRIO_RT commands. This patch creates the > ioprio_check_cap function to be used by the ioprio_set system call and also by > the aio interface. > > Signed-off-by: Adam Manzanares > Reviewed-by: Christoph Hellwig Reviewed-by: Jeff Moyer > --- > block/ioprio.c | 22 -- > include/linux/ioprio.h | 2 ++ > 2 files changed, 18 insertions(+), 6 deletions(-) > > diff --git a/block/ioprio.c b/block/ioprio.c > index 6f5d0b6625e3..f9821080c92c 100644 > --- a/block/ioprio.c > +++ b/block/ioprio.c > @@ -61,15 +61,10 @@ int set_task_ioprio(struct task_struct *task, int ioprio) > } > EXPORT_SYMBOL_GPL(set_task_ioprio); > > -SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio) > +int ioprio_check_cap(int ioprio) > { > int class = IOPRIO_PRIO_CLASS(ioprio); > int data = IOPRIO_PRIO_DATA(ioprio); > - struct task_struct *p, *g; > - struct user_struct *user; > - struct pid *pgrp; > - kuid_t uid; > - int ret; > > switch (class) { > case IOPRIO_CLASS_RT: > @@ -92,6 +87,21 @@ SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, > ioprio) > return -EINVAL; > } > > + return 0; > +} > + > +SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio) > +{ > + struct task_struct *p, *g; > + struct user_struct *user; > + struct pid *pgrp; > + kuid_t uid; > + int ret; > + > + ret = ioprio_check_cap(ioprio); > + if (ret) > + return ret; > + > ret = -ESRCH; > rcu_read_lock(); > switch (which) { > diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h > index 627efac73e6d..4a28cec49ec3 100644 > --- a/include/linux/ioprio.h > +++ b/include/linux/ioprio.h > @@ -77,4 +77,6 @@ extern int ioprio_best(unsigned short aprio, unsigned short > bprio); > > extern int set_task_ioprio(struct task_struct *task, int ioprio); > > +extern int ioprio_check_cap(int ioprio); > + > #endif
Re: [PATCH v5 5/5] fs: iomap dio set bio prio from kiocb prio
adam.manzana...@wdc.com writes: > From: Adam Manzanares> > Now that kiocb has an ioprio field copy this over to the bio when it is > created from the kiocb during direct IO. > > Signed-off-by: Adam Manzanares Reviewed-by: Jeff Moyer > --- > fs/iomap.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/iomap.c b/fs/iomap.c > index afd163586aa0..65aae194aeca 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -919,6 +919,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t > length, > bio->bi_iter.bi_sector = > (iomap->addr + pos - iomap->offset) >> 9; > bio->bi_write_hint = dio->iocb->ki_hint; > + bio->bi_ioprio = dio->iocb->ki_ioprio; > bio->bi_private = dio; > bio->bi_end_io = iomap_dio_bio_end_io;
Re: [PATCH v5 4/5] fs: blkdev set bio prio from kiocb prio
adam.manzana...@wdc.com writes: > From: Adam Manzanares> > Now that kiocb has an ioprio field copy this over to the bio when it is > created from the kiocb. > > Signed-off-by: Adam Manzanares Reviewed-by: Jeff Moyer > --- > fs/block_dev.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 7ec920e27065..da1e94d2bb75 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -355,6 +355,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter > *iter, int nr_pages) > bio->bi_write_hint = iocb->ki_hint; > bio->bi_private = dio; > bio->bi_end_io = blkdev_bio_end_io; > + bio->bi_ioprio = iocb->ki_ioprio; > > ret = bio_iov_iter_get_pages(bio, iter); > if (unlikely(ret)) {
Re: [PATCH v5 3/5] fs: Add aio iopriority support
adam.manzana...@wdc.com writes: > From: Adam Manzanares> > This is the per-I/O equivalent of the ioprio_set system call. > > When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the > newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field. > > We set the blkdev bio iopriority unconditionally, so we need to guarantee the > kiocb is initialized properly. Added changes to the loopback driver and > init_sync_kiocb to achieve this. > > This patch depends on block: add ioprio_check_cap function. > > Signed-off-by: Adam Manzanares Reviewed-by: Jeff Moyer > --- > drivers/block/loop.c | 3 +++ > fs/aio.c | 16 > include/linux/fs.h | 3 +++ > include/uapi/linux/aio_abi.h | 1 + > 4 files changed, 23 insertions(+) > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index 5d4e31655d96..dd98dfd97f5e 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -76,6 +76,8 @@ > #include > #include > #include > +#include > + > #include "loop.h" > > #include > @@ -559,6 +561,7 @@ static int lo_rw_aio(struct loop_device *lo, struct > loop_cmd *cmd, > cmd->iocb.ki_filp = file; > cmd->iocb.ki_complete = lo_rw_aio_complete; > cmd->iocb.ki_flags = IOCB_DIRECT; > + cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0); > if (cmd->css) > kthread_associate_blkcg(cmd->css); > > diff --git a/fs/aio.c b/fs/aio.c > index f3eae5d5771b..44b4572be524 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -1451,6 +1451,22 @@ static int aio_prep_rw(struct kiocb *req, struct iocb > *iocb) > if (iocb->aio_flags & IOCB_FLAG_RESFD) > req->ki_flags |= IOCB_EVENTFD; > req->ki_hint = file_write_hint(req->ki_filp); > + if (iocb->aio_flags & IOCB_FLAG_IOPRIO) { > + /* > + * If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then > + * aio_reqprio is interpreted as an I/O scheduling > + * class and priority. > + */ > + ret = ioprio_check_cap(iocb->aio_reqprio); > + if (ret) { > + pr_debug("aio ioprio check cap error\n"); > + return -EINVAL; > + } > + > + req->ki_ioprio = iocb->aio_reqprio; > + } else > + req->ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0); > + > ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags); > if (unlikely(ret)) > fput(req->ki_filp); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 50de40dbbb85..73b749ed3ea1 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -36,6 +36,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -302,6 +303,7 @@ struct kiocb { > void*private; > int ki_flags; > u16 ki_hint; > + u16 ki_ioprio; /* See linux/ioprio.h */ > } __randomize_layout; > > static inline bool is_sync_kiocb(struct kiocb *kiocb) > @@ -1942,6 +1944,7 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, > struct file *filp) > .ki_filp = filp, > .ki_flags = iocb_flags(filp), > .ki_hint = ki_hint_validate(file_write_hint(filp)), > + .ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0), > }; > } > > diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h > index 2c0a3415beee..d4e768d55d14 100644 > --- a/include/uapi/linux/aio_abi.h > +++ b/include/uapi/linux/aio_abi.h > @@ -55,6 +55,7 @@ enum { > * is valid. > */ > #define IOCB_FLAG_RESFD (1 << 0) > +#define IOCB_FLAG_IOPRIO (1 << 1) > > /* read() from /dev/aio returns these structures. */ > struct io_event {
Re: [PATCH v5 0/5] AIO add per-command iopriority
Hi, Adam, adam.manzana...@wdc.com writes: > From: Adam Manzanares> > This is the per-I/O equivalent of the ioprio_set system call. > See the following link for performance implications on a SATA HDD: > https://lkml.org/lkml/2016/12/6/495 > > First patch factors ioprio_check_cap function out of ioprio_set system call to > also be used by the aio ioprio interface. > > Second patch converts kiocb ki_hint field to a u16 to avoid kiocb bloat. > > Third patch passes ioprio hint from aio iocb to kiocb and inititalizes kiocb > ioprio value appropriately when it is not explicitly set. > > Fourth patch enables the feature for blkdev. > > Fifth patch enables the feature for iomap direct IO > > Note: this work is based on top of linux-vfs/for-next I'll cook up a libaio test case. Can you put together a man-pages update for this? Thanks! Jeff
Re: buffered I/O without buffer heads in xfs and iomap v2
On Fri, May 18, 2018 at 06:47:56PM +0200, Christoph Hellwig wrote: > Hi all, > > this series adds support for buffered I/O without buffer heads to > the iomap and XFS code. > > For now this series only contains support for block size == PAGE_SIZE, > with the 4k support split into a separate series. > > > A git tree is available at: > > git://git.infradead.org/users/hch/xfs.git xfs-iomap-read.2 > > Gitweb: > > > http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-iomap-read.2 Hmm, so I pulled this and ran my trivial stupid benchmark on for-next. It's a stupid VM with a 2G of RAM and a 12GB virtio-scsi disk backed by tmpfs: # mkfs.xfs -f -m rmapbt=0,reflink=1 /dev/sda meta-data=/dev/sda isize=512agcount=4, agsize=823296 blks = sectsz=512 attr=2, projid32bit=1 = crc=1finobt=1, sparse=1, rmapbt=1 = reflink=1 data = bsize=4096 blocks=3293184, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=3693, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # mount /dev/sda /mnt # xfs_io -f -c 'pwrite -W -S 0x64 -b 83886080 0 734003200' /mnt/a wrote 734003200/734003200 bytes at offset 0 700 MiB, 9 ops; 0:00:01.06 (655.500 MiB/sec and 8.4279 ops/sec) # cp --reflink=always /mnt/a /mnt/b # xfs_io -f -c 'pwrite -W -S 0x65 -b 83886080 0 734003200' /mnt/b wrote 734003200/734003200 bytes at offset 0 700 MiB, 9 ops; 0.9620 sec (727.615 MiB/sec and 9.3551 ops/sec) Then I applied your series (not including the blocksize < pagesize series) and saw this big regression: # mkfs.xfs -f -m rmapbt=0,reflink=1 /dev/sda meta-data=/dev/sda isize=512agcount=4, agsize=823296 blks = sectsz=512 attr=2, projid32bit=1 = crc=1finobt=1, sparse=1, rmapbt=1 = reflink=1 data = bsize=4096 blocks=3293184, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=3693, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # mount /dev/sda /mnt # xfs_io -f -c 'pwrite -W -S 0x64 -b 83886080 0 734003200' /mnt/a wrote 734003200/734003200 bytes at offset 0 700 MiB, 9 ops; 0:00:08.04 (87.031 MiB/sec and 1.1190 ops/sec) # cp --reflink=always /mnt/a /mnt/b # xfs_io -f -c 'pwrite -W -S 0x65 -b 83886080 0 734003200' /mnt/b wrote 734003200/734003200 bytes at offset 0 700 MiB, 9 ops; 0:00:21.61 (32.389 MiB/sec and 0.4164 ops/sec) I'll see if I can spot the problem while I read through the v2 code... --D > > Changes since v1: > - fix the iomap_readpages error handling > - use unsigned file offsets in a few places to avoid arithmetic overflows > - allocate a iomap_page in iomap_page_mkwrite to fix generic/095 > - improve a few comments > - add more asserts > - warn about truncated block numbers from ->bmap > - new patch to change the __do_page_cache_readahead return value to >unsigned int > - remove an incorrectly added empty line > - make inline data an explicit iomap type instead of a flag > - add a IOMAP_F_BUFFER_HEAD flag to force use of buffers heads for gfs2, >and keep the basic buffer head infrastructure around for now.
Re: [PATCH 05/34] fs: use ->is_partially_uptodate in page_cache_seek_hole_data
On Fri, May 18, 2018 at 06:48:01PM +0200, Christoph Hellwig wrote: > This way the implementation doesn't depend on buffer_head internals. > > Signed-off-by: Christoph Hellwig> --- > fs/iomap.c | 83 +++--- > 1 file changed, 42 insertions(+), 41 deletions(-) > > diff --git a/fs/iomap.c b/fs/iomap.c > index bef5e91d40bf..0fecd5789d7b 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -594,31 +594,54 @@ EXPORT_SYMBOL_GPL(iomap_fiemap); > * > * Returns the offset within the file on success, and -ENOENT otherwise. This comment is now wrong, since we return the offset via *lastoff and I think the return value is whether or not we found what we were looking for...? > */ > -static loff_t > -page_seek_hole_data(struct page *page, loff_t lastoff, int whence) > +static bool > +page_seek_hole_data(struct inode *inode, struct page *page, loff_t *lastoff, > + int whence) > { > - loff_t offset = page_offset(page); > - struct buffer_head *bh, *head; > + const struct address_space_operations *ops = inode->i_mapping->a_ops; > + unsigned int bsize = i_blocksize(inode), off; > bool seek_data = whence == SEEK_DATA; > + loff_t poff = page_offset(page); > > - if (lastoff < offset) > - lastoff = offset; > - > - bh = head = page_buffers(page); > - do { > - offset += bh->b_size; > - if (lastoff >= offset) > - continue; > + if (WARN_ON_ONCE(*lastoff >= poff + PAGE_SIZE)) > + return false; > > + if (*lastoff < poff) { > /* > - * Any buffer with valid data in it should have BH_Uptodate set. > + * Last offset smaller than the start of the page means we found > + * a hole: >*/ > - if (buffer_uptodate(bh) == seek_data) > - return lastoff; > + if (whence == SEEK_HOLE) > + return true; > + *lastoff = poff; > + } > > - lastoff = offset; > - } while ((bh = bh->b_this_page) != head); > - return -ENOENT; > + /* > + * Just check the page unless we can and should check block ranges: > + */ > + if (bsize == PAGE_SIZE || !ops->is_partially_uptodate) { > + if (PageUptodate(page) == seek_data) > + return true; > + return false; return PageUptodate(page) == seek_data; ? --D > + } > + > + lock_page(page); > + if (unlikely(page->mapping != inode->i_mapping)) > + goto out_unlock_not_found; > + > + for (off = 0; off < PAGE_SIZE; off += bsize) { > + if ((*lastoff & ~PAGE_MASK) >= off + bsize) > + continue; > + if (ops->is_partially_uptodate(page, off, bsize) == seek_data) { > + unlock_page(page); > + return true; > + } > + *lastoff = poff + off + bsize; > + } > + > +out_unlock_not_found: > + unlock_page(page); > + return false; > } > > /* > @@ -655,30 +678,8 @@ page_cache_seek_hole_data(struct inode *inode, loff_t > offset, loff_t length, > for (i = 0; i < nr_pages; i++) { > struct page *page = pvec.pages[i]; > > - /* > - * At this point, the page may be truncated or > - * invalidated (changing page->mapping to NULL), or > - * even swizzled back from swapper_space to tmpfs file > - * mapping. However, page->index will not change > - * because we have a reference on the page. > - * > - * If current page offset is beyond where we've ended, > - * we've found a hole. > - */ > - if (whence == SEEK_HOLE && > - lastoff < page_offset(page)) > + if (page_seek_hole_data(inode, page, , whence)) > goto check_range; > - > - lock_page(page); > - if (likely(page->mapping == inode->i_mapping) && > - page_has_buffers(page)) { > - lastoff = page_seek_hole_data(page, lastoff, > whence); > - if (lastoff >= 0) { > - unlock_page(page); > - goto check_range; > - } > - } > - unlock_page(page); > lastoff = page_offset(page) + PAGE_SIZE; > } > pagevec_release(); > -- > 2.17.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at
Re: [PATCH 00/10] Misc block layer patches for bcachefs
On Mon, 2018-05-21 at 11:37 -0700, Omar Sandoval wrote: > Have you made any progress in porting srp-test to blktests so we don't > have to have this conversation again? Hello Omar, Porting the srp-test software to the blktests framework is still high on my to-do list. I will start working on this as soon as the work on adding SMR support for fio has been completed. Bart.
Re: [PATCH 11/12] xfs: convert to bioset_init()/mempool_init()
On Sun, May 20, 2018 at 06:25:57PM -0400, Kent Overstreet wrote: > Signed-off-by: Kent OverstreetLooks ok, I guess... Acked-by: Darrick J. Wong --D > --- > fs/xfs/xfs_aops.c | 2 +- > fs/xfs/xfs_aops.h | 2 +- > fs/xfs/xfs_super.c | 11 +-- > 3 files changed, 7 insertions(+), 8 deletions(-) > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index 0ab824f574..102463543d 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -594,7 +594,7 @@ xfs_alloc_ioend( > struct xfs_ioend*ioend; > struct bio *bio; > > - bio = bio_alloc_bioset(GFP_NOFS, BIO_MAX_PAGES, xfs_ioend_bioset); > + bio = bio_alloc_bioset(GFP_NOFS, BIO_MAX_PAGES, _ioend_bioset); > xfs_init_bio_from_bh(bio, bh); > > ioend = container_of(bio, struct xfs_ioend, io_inline_bio); > diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h > index 69346d460d..694c85b038 100644 > --- a/fs/xfs/xfs_aops.h > +++ b/fs/xfs/xfs_aops.h > @@ -18,7 +18,7 @@ > #ifndef __XFS_AOPS_H__ > #define __XFS_AOPS_H__ > > -extern struct bio_set *xfs_ioend_bioset; > +extern struct bio_set xfs_ioend_bioset; > > /* > * Types of I/O for bmap clustering and I/O completion tracking. > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index d714240529..f643d76db5 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -63,7 +63,7 @@ > #include > > static const struct super_operations xfs_super_operations; > -struct bio_set *xfs_ioend_bioset; > +struct bio_set xfs_ioend_bioset; > > static struct kset *xfs_kset;/* top-level xfs sysfs dir */ > #ifdef DEBUG > @@ -1845,10 +1845,9 @@ MODULE_ALIAS_FS("xfs"); > STATIC int __init > xfs_init_zones(void) > { > - xfs_ioend_bioset = bioset_create(4 * MAX_BUF_PER_PAGE, > + if (bioset_init(_ioend_bioset, 4 * MAX_BUF_PER_PAGE, > offsetof(struct xfs_ioend, io_inline_bio), > - BIOSET_NEED_BVECS); > - if (!xfs_ioend_bioset) > + BIOSET_NEED_BVECS)) > goto out; > > xfs_log_ticket_zone = kmem_zone_init(sizeof(xlog_ticket_t), > @@ -1997,7 +1996,7 @@ xfs_init_zones(void) > out_destroy_log_ticket_zone: > kmem_zone_destroy(xfs_log_ticket_zone); > out_free_ioend_bioset: > - bioset_free(xfs_ioend_bioset); > + bioset_exit(_ioend_bioset); > out: > return -ENOMEM; > } > @@ -2029,7 +2028,7 @@ xfs_destroy_zones(void) > kmem_zone_destroy(xfs_btree_cur_zone); > kmem_zone_destroy(xfs_bmap_free_item_zone); > kmem_zone_destroy(xfs_log_ticket_zone); > - bioset_free(xfs_ioend_bioset); > + bioset_exit(_ioend_bioset); > } > > STATIC int __init > -- > 2.17.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/10] Misc block layer patches for bcachefs
On Mon, May 21, 2018 at 03:11:08PM +, Bart Van Assche wrote: > On Sun, 2018-05-20 at 19:58 -0400, Kent Overstreet wrote: > > On Sun, May 20, 2018 at 11:40:45PM +, Bart Van Assche wrote: > > > On Sun, 2018-05-20 at 19:21 -0400, Kent Overstreet wrote: > > > > I really have better things to do than debug someone else's tests... > > > > [ ... ] > > > > ../run_tests: line 65: cd: /lib/modules/4.16.0+/kernel/block: No such > > > > file or directory > > > > > > Kernel v4.16 is too old to run these tests. The srp-test script needs the > > > following commit that went upstream in kernel v4.17-rc1: > > > > > > 63cf1a902c9d ("IB/srpt: Add RDMA/CM support") > > > > Same output on Jens' for-next branch. > > Others have been able to run the srp-test software with the instructions > provided earlier in this e-mail thread. Can you share the kernel messages from > around the time the test was run (dmesg, /var/log/messages or > /var/log/syslog)? > > Thanks, > > Bart. Bart, Have you made any progress in porting srp-test to blktests so we don't have to have this conversation again?
Re: [PATCH RESEND] loop: clear wb_err in bd_inode when detaching backing file
On 5/21/18 12:35 PM, Jeff Layton wrote: > From: Jeff Layton> > When a loop block device encounters a writeback error, that error will > get propagated to the bd_inode's wb_err field. If we then detach the > backing file from it, attach another and fsync it, we'll get back the > writeback error that we had from the previous backing file. > > This is a bit of a grey area as POSIX doesn't cover loop devices, but it > is somewhat counterintuitive. > > If we detach a backing file from the loopdev while there are still > unreported errors, take it as a sign that we're no longer interested in > the previous file, and clear out the wb_err in the loop blockdev. Applied, thanks Jeff. -- Jens Axboe
[PATCH RESEND] loop: clear wb_err in bd_inode when detaching backing file
From: Jeff LaytonWhen a loop block device encounters a writeback error, that error will get propagated to the bd_inode's wb_err field. If we then detach the backing file from it, attach another and fsync it, we'll get back the writeback error that we had from the previous backing file. This is a bit of a grey area as POSIX doesn't cover loop devices, but it is somewhat counterintuitive. If we detach a backing file from the loopdev while there are still unreported errors, take it as a sign that we're no longer interested in the previous file, and clear out the wb_err in the loop blockdev. Reported-and-Tested-by: Theodore Y. Ts'o Signed-off-by: Jeff Layton --- drivers/block/loop.c | 1 + 1 file changed, 1 insertion(+) Resend to fix the attributions. diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 5d4e31655d96..55cf554bc914 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1068,6 +1068,7 @@ static int loop_clr_fd(struct loop_device *lo) if (bdev) { bdput(bdev); invalidate_bdev(bdev); + bdev->bd_inode->i_mapping->wb_err = 0; } set_capacity(lo->lo_disk, 0); loop_sysfs_exit(lo); -- 2.17.0
Re: [PATCH] loop: clear wb_err in bd_inode when detaching backing file
On 5/21/18 12:26 PM, Jeff Layton wrote: > On Mon, 2018-05-21 at 08:37 -0400, Jeff Layton wrote: >> From: Jeff Layton>> >> When a loop block device encounters a writeback error, that error will >> get propagated to the bd_inode's wb_err field. If we then detach the >> backing file from it, attach another and fsync it, we'll get back the >> writeback error that we had from the previous backing file. >> >> This is a bit of a grey area as POSIX doesn't cover loop devices, but it >> is somewhat counterintuitive. >> >> If we detach a backing file from the loopdev while there are still >> unreported errors, take it as a sign that we're no longer interested in >> the previous file, and clear out the wb_err in the loop blockdev. >> >> Signed-off-by: Jeff Layton >> --- >> drivers/block/loop.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/drivers/block/loop.c b/drivers/block/loop.c >> index 5d4e31655d96..55cf554bc914 100644 >> --- a/drivers/block/loop.c >> +++ b/drivers/block/loop.c >> @@ -1068,6 +1068,7 @@ static int loop_clr_fd(struct loop_device *lo) >> if (bdev) { >> bdput(bdev); >> invalidate_bdev(bdev); >> +bdev->bd_inode->i_mapping->wb_err = 0; >> } >> set_capacity(lo->lo_disk, 0); >> loop_sysfs_exit(lo); > > Jens, > > I sent this to the mailing lists earlier, but it occurs to me that > loop.c changes should probably go through the block tree. Would you > mind picking this up and feeding it to next and on to Linus? > > I think we probably want this in v4.17 if possible, as it is a > regression from v4.16 (due to commit b4678df184b3) and was causing some > failures in xfstests runs. > > Hmm, now that I think about it, we should probably also give Ted > "Reported-and-Tested-by" credit here too. Let me know if you'd rather I > resend with that added. Yeah, I can funnel it for 4.17 - a quick resend with the right attributions would be appreciated, thanks. -- Jens Axboe
Re: [PATCH blktests] Documentation: document prerequisite scriptlets
On Mon, May 14, 2018 at 01:40:42PM +0200, Johannes Thumshirn wrote: > The config file is bash and it gets sourced, so all bash magic is > doable in there as well. Document it so others don't have to > re-discover this gem as well. I'm supportive of this... > Signed-off-by: Johannes Thumshirn> --- > Documentation/running-tests.md | 12 > 1 file changed, 12 insertions(+) > > diff --git a/Documentation/running-tests.md b/Documentation/running-tests.md > index a479d5e94c5e..b477c0679683 100644 > --- a/Documentation/running-tests.md > +++ b/Documentation/running-tests.md > @@ -66,3 +66,15 @@ command line option. > QUICK_RUN=1 > TIMEOUT=30 > ``` > + > +### Pre-test setups > + > +Some tests, may need special prerequisites, like configfs being > +mounted for NVMe over Fabrics tests. You can add your custom bash > +scriptlets to `config` to get this done, e.g.: > + > +```sh > +if ! test $(grep -q configfs /proc/mounts) ; then > +mount -t configfs none /sys/kernel/config > +fi > +``` But I'm curious about this specific example. Is this not mounted for you automatically? I'm guessing systemd does it for me on my setup.
Re: [PATCH] loop: clear wb_err in bd_inode when detaching backing file
On Mon, 2018-05-21 at 08:37 -0400, Jeff Layton wrote: > From: Jeff Layton> > When a loop block device encounters a writeback error, that error will > get propagated to the bd_inode's wb_err field. If we then detach the > backing file from it, attach another and fsync it, we'll get back the > writeback error that we had from the previous backing file. > > This is a bit of a grey area as POSIX doesn't cover loop devices, but it > is somewhat counterintuitive. > > If we detach a backing file from the loopdev while there are still > unreported errors, take it as a sign that we're no longer interested in > the previous file, and clear out the wb_err in the loop blockdev. > > Signed-off-by: Jeff Layton > --- > drivers/block/loop.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index 5d4e31655d96..55cf554bc914 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -1068,6 +1068,7 @@ static int loop_clr_fd(struct loop_device *lo) > if (bdev) { > bdput(bdev); > invalidate_bdev(bdev); > + bdev->bd_inode->i_mapping->wb_err = 0; > } > set_capacity(lo->lo_disk, 0); > loop_sysfs_exit(lo); Jens, I sent this to the mailing lists earlier, but it occurs to me that loop.c changes should probably go through the block tree. Would you mind picking this up and feeding it to next and on to Linus? I think we probably want this in v4.17 if possible, as it is a regression from v4.16 (due to commit b4678df184b3) and was causing some failures in xfstests runs. Hmm, now that I think about it, we should probably also give Ted "Reported-and-Tested-by" credit here too. Let me know if you'd rather I resend with that added. Thanks, -- Jeff Layton
Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()
On Mon, May 21 2018 at 1:37pm -0400, Kent Overstreetwrote: > > Uh, you came across as upset and paranoid to me too. Chalk it up to email? :) Awesome. See how easy it is to make someone with purely constructive questions and feedback come off as upset and paranoid? The tipping point occurs when bait is set with: "It's not like ". Then: "Let's focus on getting it reviewed, rather than pontificate on what could potentially go all wrong with this." Message received: less pontificating, more doing! And here I thought that focusing on what could go wrong (across all code touched) was review. But OK, I'm the one that made this all weird ;) It is what it is at this point. > I personally don't care, I have no horse in this race. This particular patch > series wasn't my idea, Christoph wanted all these conversions done so > bioset_create() could be deleted. If you want us to hold off on the dm patch > for > awhile until someone can get around to testing it or whatever (since I don't > have tests for everything I pushed) that would be perfectly fine by me. As I clarified already: this isn't about DM. Every single data structure change in this series should be reviewed for unforeseen alignment consequences. Jens seemed to say that is worthwhile. Not sure if he'll do it or we divide it up. If we divide it up a temp topic branch should be published for others to inspect. Could be alignment hasn't been a historic concern for a bunch of the data structures changed in this series.. if so then all we can do is fix up any obvious potential for false sharing. Mike
[PATCH] block: Verify whether blk_queue_enter() is used when necessary
It is required to protect blkg_lookup() calls with a blk_queue_enter() / blk_queue_exit() pair. Since it is nontrivial to verify whether this is the case, verify this at runtime. Only perform this verification if CONFIG_LOCKDEP=y to avoid that unnecessary runtime overhead is added. Note: using lockdep to verify whether blkg_lookup() is protected correctly is not possible since lock_acquire() and lock_release() must be called from the same task and since blk_queue_enter() and blk_queue_exit() can be called from different tasks. Suggested-by: Tejun HeoSigned-off-by: Bart Van Assche Cc: Tejun Heo --- block/blk-cgroup.c | 2 ++ block/blk-core.c| 24 include/linux/blk-cgroup.h | 2 ++ include/linux/blkdev.h | 11 +++ include/linux/percpu-refcount.h | 2 ++ lib/percpu-refcount.c | 25 + 6 files changed, 66 insertions(+) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index eb85cb87c40f..78822dcfa0da 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -145,6 +145,8 @@ struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg, { struct blkcg_gq *blkg; + WARN_ON_ONCE(!blk_entered_queue(q)); + /* * Hint didn't match. Look up from the radix tree. Note that the * hint can only be updated under queue_lock as otherwise @blkg diff --git a/block/blk-core.c b/block/blk-core.c index 8b9e5dc882f4..b6fa6a9f7daa 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -687,6 +687,9 @@ EXPORT_SYMBOL_GPL(blk_queue_bypass_end); void blk_set_queue_dying(struct request_queue *q) { +#ifdef CONFIG_PROVE_LOCKING + q->cleanup_queue_task = current; +#endif blk_queue_flag_set(QUEUE_FLAG_DYING, q); /* @@ -907,6 +910,25 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask) } EXPORT_SYMBOL(blk_alloc_queue); +#ifdef CONFIG_PROVE_LOCKING +/** + * blk_entered_queue() - whether or not it is safe to access cgroup information + * @q: request queue pointer + * + * In order to avoid races between accessing cgroup information and the cgroup + * information removal from inside blk_cleanup_queue(), any code that accesses + * cgroup information must either be protected by blk_queue_enter() and/or + * blk_queue_enter_live() or must be called after the queue has been marked + * dying from the same task that called blk_cleanup_queue(). + */ +bool blk_entered_queue(struct request_queue *q) +{ + return (blk_queue_dying(q) && current == q->cleanup_queue_task) || + percpu_ref_read(>q_usage_counter) > 0; +} +EXPORT_SYMBOL(blk_entered_queue); +#endif + /** * blk_queue_enter() - try to increase q->q_usage_counter * @q: request queue pointer @@ -2254,6 +2276,8 @@ generic_make_request_checks(struct bio *bio) goto end_io; } + WARN_ON_ONCE(!blk_entered_queue(q)); + /* * For a REQ_NOWAIT based request, return -EOPNOTSUPP * if queue is not a request based queue. diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index 6c666fd7de3c..3b8512c259aa 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -266,6 +266,8 @@ static inline struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg, { struct blkcg_gq *blkg; + WARN_ON_ONCE(!blk_entered_queue(q)); + if (blkcg == _root) return q->root_blkg; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 780e4ea80d4d..0ed23677c36f 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -649,6 +649,9 @@ struct request_queue { int bypass_depth; atomic_tmq_freeze_depth; +#ifdef CONFIG_PROVE_LOCKING + struct task_struct *cleanup_queue_task; +#endif #if defined(CONFIG_BLK_DEV_BSG) bsg_job_fn *bsg_job_fn; @@ -1000,6 +1003,14 @@ extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t, extern int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags); extern void blk_queue_exit(struct request_queue *q); +#ifdef CONFIG_PROVE_LOCKING +extern bool blk_entered_queue(struct request_queue *q); +#else +static inline bool blk_entered_queue(struct request_queue *q) +{ + return true; +} +#endif extern void blk_start_queue(struct request_queue *q); extern void blk_start_queue_async(struct request_queue *q); extern void blk_stop_queue(struct request_queue *q); diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index 009cdf3d65b6..5707289ba828 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -331,4 +331,6 @@ static inline bool percpu_ref_is_zero(struct percpu_ref *ref) return !atomic_long_read(>count); } +unsigned long percpu_ref_read(struct percpu_ref *ref); + #endif diff --git
Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()
On Mon, May 21, 2018 at 12:09:14PM -0400, Mike Snitzer wrote: > On Mon, May 21 2018 at 11:36am -0400, > Jens Axboewrote: > > > On 5/21/18 9:18 AM, Mike Snitzer wrote: > > > On Mon, May 21 2018 at 11:09am -0400, > > > Jens Axboe wrote: > > > > > >> On 5/21/18 9:04 AM, Mike Snitzer wrote: > > >>> On Mon, May 21 2018 at 10:52am -0400, > > >>> Jens Axboe wrote: > > >>> > ... > > IMHO you're making a big deal out of something that should not be. > > >>> > > >>> I raised an issue that had seemingly not been considered at all. Not > > >>> making a big deal. Raising it for others' benefit. > > >>> > > If the dm bits are that sensitive and cache line honed to perfection > > already due to previous regressions in that area, then it might > > not be a bad idea to have some compile checks for false cacheline > > sharing between sensitive members, or spilling of a sub-struct > > into multiple cachelines. > > > > It's not like this was pushed behind your back. It's posted for > > review. It's quite possible the net change is a win for dm. Let's > > focus on getting it reviewed, rather than pontificate on what > > could potentially go all wrong with this. > > >>> > > >>> Why are you making this personal? Or purely about DM? I'm merely > > >>> pointing out this change isn't something that can be given a quick > > >>> blanket "looks good". > > >> > > >> I'm not making this personal at all?! You raised a (valid) concern, > > >> I'm merely stating why I don't think it's a high risk issue. I'm > > >> assuming your worry is related to dm, as those are the reports > > >> that would ultimately land on your desk. > > > > > > Then we'll just agree to disagree with what this implies: "It's not like > > > this was pushed behind your back." > > > > I'm afraid you've lost me now - it was not pushed behind your back, > > it was posted for review, with you on the CC list. Not trying to > > be deliberately dense here, I just don't see what our disagreement is. > > You're having an off day ;) Mondays and all? > > I just raised an alignment concern that needs to be considered during > review by all stakeholders. Wasn't upset at all. Maybe my email tone > came off otherwise? > > And then you got confused by me pointing out how it was weird for you to > suggest I felt this stuff was pushed behind my back.. and went on to > think I really think that. It's almost like you're a confused hypnotist > seeding me with paranoid dilutions. ;) Uh, you came across as upset and paranoid to me too. Chalk it up to email? :) I personally don't care, I have no horse in this race. This particular patch series wasn't my idea, Christoph wanted all these conversions done so bioset_create() could be deleted. If you want us to hold off on the dm patch for awhile until someone can get around to testing it or whatever (since I don't have tests for everything I pushed) that would be perfectly fine by me.
[PATCH v12 2/2] blk-mq: Rework blk-mq timeout handling again
Recently the blk-mq timeout handling code was reworked. See also Tejun Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html). This patch reworks the blk-mq timeout handling code again. The timeout handling code is simplified by introducing a state machine per request. This change avoids that the blk-mq timeout handling code ignores completions that occur after blk_mq_check_expired() has been called and before blk_mq_rq_timed_out() has been called. Fix this race as follows: - Replace the __deadline member of struct request by a new member called das that contains the generation number, state and deadline. Only 32 bits are used for the deadline field such that all three fields occupy only 64 bits. This change reduces the maximum supported request timeout value from (2**63/HZ) to (2**31/HZ). - Remove all request member variables that became superfluous due to this change: gstate, gstate_seq and aborted_gstate_sync. - Remove the request state information that became superfluous due to this patch, namely RQF_MQ_TIMEOUT_EXPIRED. - Remove the code that became superfluous due to this change, namely the RCU lock and unlock statements in blk_mq_complete_request() and also the synchronize_rcu() call in the timeout handler. Notes: - A spinlock is used to protect atomic changes of rq->das on those architectures that do not provide a cmpxchg64() implementation. - Atomic instructions are only used to update the request state if a concurrent request state change could be in progress. - blk_add_timer() has been split into two functions - one for the legacy block layer and one for blk-mq. Signed-off-by: Bart Van AsscheCc: Tejun Heo Cc: Christoph Hellwig Cc: Jianchao Wang Cc: Ming Lei Cc: Sebastian Ott Cc: Sagi Grimberg Cc: Israel Rukshin , Cc: Max Gurtovoy --- block/blk-core.c | 6 -- block/blk-mq-debugfs.c | 1 - block/blk-mq.c | 238 ++--- block/blk-mq.h | 64 ++--- block/blk-timeout.c| 133 ++- block/blk.h| 11 +-- include/linux/blkdev.h | 48 +- 7 files changed, 263 insertions(+), 238 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 341501c5e239..a7301fcda734 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request *rq) rq->internal_tag = -1; rq->start_time_ns = ktime_get_ns(); rq->part = NULL; - seqcount_init(>gstate_seq); - u64_stats_init(>aborted_gstate_sync); - /* -* See comment of blk_mq_init_request -*/ - WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC); } EXPORT_SYMBOL(blk_rq_init); diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 3080e18cb859..ffa622366922 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -344,7 +344,6 @@ static const char *const rqf_name[] = { RQF_NAME(STATS), RQF_NAME(SPECIAL_PAYLOAD), RQF_NAME(ZONE_WRITE_LOCKED), - RQF_NAME(MQ_TIMEOUT_EXPIRED), RQF_NAME(MQ_POLL_SLEPT), }; #undef RQF_NAME diff --git a/block/blk-mq.c b/block/blk-mq.c index 64630caaf27e..af9375ffa4b1 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -318,7 +318,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, rq->special = NULL; /* tag was already set */ rq->extra_len = 0; - rq->__deadline = 0; + WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE); INIT_LIST_HEAD(>timeout_list); rq->timeout = 0; @@ -465,6 +465,39 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, } EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx); +/** + * blk_mq_change_rq_state - atomically test and set request state + * @rq: Request pointer. + * @old_state: Old request state. + * @new_state: New request state. + * + * Returns %true if and only if the old state was @old and if the state has + * been changed into @new. + */ +static bool blk_mq_change_rq_state(struct request *rq, + enum mq_rq_state old_state, + enum mq_rq_state new_state) +{ + union blk_deadline_and_state das = READ_ONCE(rq->das); + union blk_deadline_and_state old_val = das; + union blk_deadline_and_state new_val = das; + + WARN_ON_ONCE(new_state == MQ_RQ_IN_FLIGHT); + + old_val.state = old_state; + new_val.state = new_state; + /* +* For transitions from state in-flight to another state cmpxchg64() +* must be used. For other state transitions it is safe to use +* WRITE_ONCE(). +*/ + if (old_state
[PATCH v12 1/2] ARM: cmpxchg64: Only define cmpxchg64() if not using the Thumb instruction set
Since the implementation of cmpxchg64() uses instructions that are not Thumb instructions, only define cmpxchg64() if not building in Thumb mode. This patch allows the next patch in this series to check for cmpxchg64() support using #if defined(cmpxchg64). Signed-off-by: Bart Van AsscheCc: Catalin Marinas Cc: Will Deacon Cc: Russell King --- arch/arm/include/asm/cmpxchg.h | 4 1 file changed, 4 insertions(+) diff --git a/arch/arm/include/asm/cmpxchg.h b/arch/arm/include/asm/cmpxchg.h index 8b701f8e175c..41ab99bc04ca 100644 --- a/arch/arm/include/asm/cmpxchg.h +++ b/arch/arm/include/asm/cmpxchg.h @@ -141,7 +141,9 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size sizeof(*(ptr)));\ }) +#ifndef CONFIG_THUMB2_KERNEL #define cmpxchg64_local(ptr, o, n) __cmpxchg64_local_generic((ptr), (o), (n)) +#endif /* CONFIG_THUMB2_KERNEL */ #include @@ -241,6 +243,7 @@ static inline unsigned long __cmpxchg_local(volatile void *ptr, sizeof(*(ptr)));\ }) +#ifndef CONFIG_THUMB2_KERNEL static inline unsigned long long __cmpxchg64(unsigned long long *ptr, unsigned long long old, unsigned long long new) @@ -273,6 +276,7 @@ static inline unsigned long long __cmpxchg64(unsigned long long *ptr, }) #define cmpxchg64_local(ptr, o, n) cmpxchg64_relaxed((ptr), (o), (n)) +#endif /* CONFIG_THUMB2_KERNEL */ #endif /* __LINUX_ARM_ARCH__ >= 6 */ -- 2.16.3
[PATCH v12 0/2] blk-mq: Rework blk-mq timeout handling again
Hello Jens, This patch series reworks blk-mq timeout handling by introducing a state machine per request. Please consider this patch series for inclusion in the upstream kernel. Bart. Changes compared to v11: - Reworked patch 1/2: instead of introducing CONFIG_ARCH_HAVE_CMPXCHG64, make sure that cmpxchg64() is only defined if it can be used. Changes compared to v10: - In patch 1/2, added "default y if 64BIT" to the "config ARCH_HAVE_CMPXCHG64" entry in arch/Kconfig. Left out the "select ARCH_HAVE_CMPXCHG64" statements that became superfluous due to this change (alpha, arm64, powerpc and s390). - Also in patch 1/2, only select ARCH_HAVE_CMPXCHG64 if X86_CMPXCHG64 has been selected. - In patch 2/2, moved blk_mq_change_rq_state() from blk-mq.h to blk-mq.c. - Added a comment header above __blk_mq_requeue_request() and blk_mq_requeue_request(). - Documented the MQ_RQ_* state transitions in block/blk-mq.h. - Left out the fourth argument of blk_mq_rq_set_deadline(). Changes compared to v9: - Addressed multiple comments related to patch 1/2: added CONFIG_ARCH_HAVE_CMPXCHG64 for riscv, modified features/locking/cmpxchg64/arch-support.txt as requested and made the order of the symbols in the arch/*/Kconfig alphabetical where possible. Changes compared to v8: - Split into two patches. - Moved the spin_lock_init() call from blk_mq_rq_ctx_init() into blk_mq_init_request(). - Fixed the deadline set by blk_add_timer(). - Surrounded the das_lock member with #ifndef CONFIG_ARCH_HAVE_CMPXCHG64 / #endif. Changes compared to v7: - Fixed the generation number mechanism. Note: with this patch applied the behavior of the block layer does not depend on the generation number. - Added more 32-bit architectures to the list of architectures on which cmpxchg64() should not be used. Changes compared to v6: - Used a union instead of bit manipulations to store multiple values into a single 64-bit field. - Reduced the size of the timeout field from 64 to 32 bits. - Made sure that the block layer still builds with this patch applied for the sh and mips architectures. - Fixed two sparse warnings that were introduced by this patch in the WRITE_ONCE() calls. Changes compared to v5: - Restored the synchronize_rcu() call between marking a request for timeout handling and the actual timeout handling to avoid that timeout handling starts while .queue_rq() is still in progress if the timeout is very short. - Only use cmpxchg() if another context could attempt to change the request state concurrently. Use WRITE_ONCE() otherwise. Changes compared to v4: - Addressed multiple review comments from Christoph. The most important are that atomic_long_cmpxchg() has been changed into cmpxchg() and also that there is now a nice and clean split between the legacy and blk-mq versions of blk_add_timer(). - Changed the patch name and modified the patch description because there is disagreement about whether or not the v4.16 blk-mq core can complete a single request twice. Kept the "Cc: stable" tag because of https://bugzilla.kernel.org/show_bug.cgi?id=199077. Changes compared to v3 (see also https://www.mail-archive.com/linux-block@vger.kernel.org/msg20073.html): - Removed the spinlock again that was introduced to protect the request state. v4 uses atomic_long_cmpxchg() instead. - Split __deadline into two variables - one for the legacy block layer and one for blk-mq. Changes compared to v2 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18338.html): - Rebased and retested on top of kernel v4.16. Changes compared to v1 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html): - Removed the gstate and aborted_gstate members of struct request and used the __deadline member to encode both the generation and state information. Bart Van Assche (2): ARM: cmpxchg64: Only define cmpxchg64() if not using the Thumb instruction set blk-mq: Rework blk-mq timeout handling again arch/arm/include/asm/cmpxchg.h | 4 + block/blk-core.c | 6 -- block/blk-mq-debugfs.c | 1 - block/blk-mq.c | 238 +++-- block/blk-mq.h | 64 +-- block/blk-timeout.c| 133 +++ block/blk.h| 11 +- include/linux/blkdev.h | 48 + 8 files changed, 267 insertions(+), 238 deletions(-) -- 2.16.3
Re: [PATCH 1/6] nvme: Sync request queues on reset
On Tue, May 22, 2018 at 12:08:37AM +0800, Ming Lei wrote: > Please take a look at blk_mq_complete_request(). Even with Bart's > change, the request still won't be completed by driver. The request can > only be completed by either driver or blk-mq, not both. So you're saying blk-mq can't complete a request the driver returned to blk-mq to complete. And that's the nvme driver's problem to fix?
Re: [PATCH 3/6] nvme: Move all IO out of controller reset
On Tue, May 22, 2018 at 12:04:53AM +0800, Ming Lei wrote: > On Mon, May 21, 2018 at 09:44:33AM -0600, Keith Busch wrote: > > On Mon, May 21, 2018 at 11:34:27PM +0800, Ming Lei wrote: > > > nvme_dev_disable() quiesces queues first before killing queues. > > > > > > If queues are quiesced during or before nvme_wait_freeze() is run > > > from the 2nd part of reset, the 2nd part can't move on, and IO hang > > > is caused. Finally no reset can be scheduled at all. > > > > But this patch moves nvme_wait_freeze outside the reset path, so I'm > > afraid I'm unable to follow how you've concluded the wait freeze is > > somehow part of the reset. > > For example: > > 1) the 1st timeout event: > > - nvme_dev_disable() > - reset > - scan_work > > 2) the 2nd timeout event: > > nvme_dev_disable() may come just after nvme_start_queues() in > the above reset of the 1st timeout. And nvme_timeout() won't > schedule a new reset since the controller state is NVME_CTRL_CONNECTING. Let me get this straight -- you're saying nvme_start_queues is going to somehow immediately trigger timeout work? I can't see how that could possibly happen in real life, but we can just remove it and use the existing nvme_start_ctrl to handle that in the LIVE state.
Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()
On 5/21/18 10:09 AM, Mike Snitzer wrote: > On Mon, May 21 2018 at 11:36am -0400, > Jens Axboewrote: > >> On 5/21/18 9:18 AM, Mike Snitzer wrote: >>> On Mon, May 21 2018 at 11:09am -0400, >>> Jens Axboe wrote: >>> On 5/21/18 9:04 AM, Mike Snitzer wrote: > On Mon, May 21 2018 at 10:52am -0400, > Jens Axboe wrote: > > ... >> IMHO you're making a big deal out of something that should not be. > > I raised an issue that had seemingly not been considered at all. Not > making a big deal. Raising it for others' benefit. > >> If the dm bits are that sensitive and cache line honed to perfection >> already due to previous regressions in that area, then it might >> not be a bad idea to have some compile checks for false cacheline >> sharing between sensitive members, or spilling of a sub-struct >> into multiple cachelines. >> >> It's not like this was pushed behind your back. It's posted for >> review. It's quite possible the net change is a win for dm. Let's >> focus on getting it reviewed, rather than pontificate on what >> could potentially go all wrong with this. > > Why are you making this personal? Or purely about DM? I'm merely > pointing out this change isn't something that can be given a quick > blanket "looks good". I'm not making this personal at all?! You raised a (valid) concern, I'm merely stating why I don't think it's a high risk issue. I'm assuming your worry is related to dm, as those are the reports that would ultimately land on your desk. >>> >>> Then we'll just agree to disagree with what this implies: "It's not like >>> this was pushed behind your back." >> >> I'm afraid you've lost me now - it was not pushed behind your back, >> it was posted for review, with you on the CC list. Not trying to >> be deliberately dense here, I just don't see what our disagreement is. > > You're having an off day ;) Mondays and all? > > I just raised an alignment concern that needs to be considered during > review by all stakeholders. Wasn't upset at all. Maybe my email tone > came off otherwise? > > And then you got confused by me pointing out how it was weird for you to > suggest I felt this stuff was pushed behind my back.. and went on to > think I really think that. It's almost like you're a confused hypnotist > seeding me with paranoid dilutions. ;) > > /me waits for Jens to snap his fingers so he can just slowly step away > from this line of discussion that is solidly dead now... Mike, wtf are you talking about. >>> Reality is I'm fine with the change. Just think there is follow-on work >>> (now or later) that is needed. >> >> It's not hard to run the quick struct layout checks or alignment. If >> there's a concern, that should be done now, instead of being deferred to >> later. There's no point merging something that we expect to have >> follow-on work. If that's the case, then it should not be merged. There >> are no dependencies in the patchset, except that the last patch >> obviously can't be applied until all of the previous ones are in. > > Cool, sounds good. > >> I took a quick look at the struct mapped_device layout, which I'm >> assuming is the most important on the dm side. It's pretty big to >> begin with, obviously this makes it bigger since we're now >> embedding the bio_sets. On my config, doesn't show any false sharing >> that would be problematic, or layout changes that would worry me. FWIW. > > Great, thanks, do you happen to have a tree you can push so others can > get a quick compile and look at the series fully applied? I do not, I simply ran a quick analysis of the layout, then applied the full patchset, and repeated that exercise. Literally a 5 minute thing. I haven't applied the series, so haven't pushed anything out. > BTW, I'm upstream DM maintainer but I have a role in caring for related > subsystems (e.g. block core, etc) in downstream releases.. *cough* RHEL. > So this alignment concern wasn't ever purely about DM. I tend to care about that too... -- Jens Axboe
Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()
On Mon, May 21 2018 at 11:36am -0400, Jens Axboewrote: > On 5/21/18 9:18 AM, Mike Snitzer wrote: > > On Mon, May 21 2018 at 11:09am -0400, > > Jens Axboe wrote: > > > >> On 5/21/18 9:04 AM, Mike Snitzer wrote: > >>> On Mon, May 21 2018 at 10:52am -0400, > >>> Jens Axboe wrote: > >>> ... > IMHO you're making a big deal out of something that should not be. > >>> > >>> I raised an issue that had seemingly not been considered at all. Not > >>> making a big deal. Raising it for others' benefit. > >>> > If the dm bits are that sensitive and cache line honed to perfection > already due to previous regressions in that area, then it might > not be a bad idea to have some compile checks for false cacheline > sharing between sensitive members, or spilling of a sub-struct > into multiple cachelines. > > It's not like this was pushed behind your back. It's posted for > review. It's quite possible the net change is a win for dm. Let's > focus on getting it reviewed, rather than pontificate on what > could potentially go all wrong with this. > >>> > >>> Why are you making this personal? Or purely about DM? I'm merely > >>> pointing out this change isn't something that can be given a quick > >>> blanket "looks good". > >> > >> I'm not making this personal at all?! You raised a (valid) concern, > >> I'm merely stating why I don't think it's a high risk issue. I'm > >> assuming your worry is related to dm, as those are the reports > >> that would ultimately land on your desk. > > > > Then we'll just agree to disagree with what this implies: "It's not like > > this was pushed behind your back." > > I'm afraid you've lost me now - it was not pushed behind your back, > it was posted for review, with you on the CC list. Not trying to > be deliberately dense here, I just don't see what our disagreement is. You're having an off day ;) Mondays and all? I just raised an alignment concern that needs to be considered during review by all stakeholders. Wasn't upset at all. Maybe my email tone came off otherwise? And then you got confused by me pointing out how it was weird for you to suggest I felt this stuff was pushed behind my back.. and went on to think I really think that. It's almost like you're a confused hypnotist seeding me with paranoid dilutions. ;) /me waits for Jens to snap his fingers so he can just slowly step away from this line of discussion that is solidly dead now... > > Reality is I'm fine with the change. Just think there is follow-on work > > (now or later) that is needed. > > It's not hard to run the quick struct layout checks or alignment. If > there's a concern, that should be done now, instead of being deferred to > later. There's no point merging something that we expect to have > follow-on work. If that's the case, then it should not be merged. There > are no dependencies in the patchset, except that the last patch > obviously can't be applied until all of the previous ones are in. Cool, sounds good. > I took a quick look at the struct mapped_device layout, which I'm > assuming is the most important on the dm side. It's pretty big to > begin with, obviously this makes it bigger since we're now > embedding the bio_sets. On my config, doesn't show any false sharing > that would be problematic, or layout changes that would worry me. FWIW. Great, thanks, do you happen to have a tree you can push so others can get a quick compile and look at the series fully applied? BTW, I'm upstream DM maintainer but I have a role in caring for related subsystems (e.g. block core, etc) in downstream releases.. *cough* RHEL. So this alignment concern wasn't ever purely about DM. Mike
Re: [PATCH 1/6] nvme: Sync request queues on reset
On Mon, May 21, 2018 at 09:59:09AM -0600, Keith Busch wrote: > On Mon, May 21, 2018 at 11:25:43PM +0800, Ming Lei wrote: > > On Mon, May 21, 2018 at 08:04:13AM -0600, Keith Busch wrote: > > > On Sat, May 19, 2018 at 08:01:42AM +0800, Ming Lei wrote: > > > > > You keep saying that, but the controller state is global to the > > > > > controller. It doesn't matter which namespace request_queue started > > > > > the > > > > > reset: every namespaces request queue sees the RESETTING controller > > > > > state > > > > > > > > When timeouts come, the global state of RESETTING may not be updated > > > > yet, so all the timeouts may not observe the state. > > > > > > Even prior to the RESETING state, every single command, no matter > > > which namespace or request_queue it came on, is reclaimed by the driver. > > > There *should* be no requests to timeout after nvme_dev_disable is called > > > because the nvme driver returned control of all requests in the tagset > > > to blk-mq. > > > > The timed-out requests won't be canceled by nvme_dev_disable(). > > ??? nvme_dev_disable cancels all started requests. There are no exceptions. Please take a look at blk_mq_complete_request(). Even with Bart's change, the request still won't be completed by driver. The request can only be completed by either driver or blk-mq, not both. > > > If the timed-out requests is handled as RESET_TIMER, there may > > be new timeout event triggered again. > > > > > > > > In any case, if blk-mq decides it won't complete those requests, we > > > can just swap the order in the reset_work: sync first, uncondintionally > > > disable. Does the following snippet look more okay? > > > > > > --- > > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > > > index 17a0190bd88f..42af077ee07a 100644 > > > --- a/drivers/nvme/host/pci.c > > > +++ b/drivers/nvme/host/pci.c > > > @@ -2307,11 +2307,14 @@ static void nvme_reset_work(struct work_struct > > > *work) > > > goto out; > > > > > > /* > > > - * If we're called to reset a live controller first shut it down before > > > - * moving on. > > > + * Ensure there are no timeout work in progress prior to forcefully > > > + * disabling the queue. There is no harm in disabling the device even > > > + * when it was already disabled, as this will forcefully reclaim any > > > + * IOs that are stuck due to blk-mq's timeout handling that prevents > > > + * timed out requests from completing. > > >*/ > > > - if (dev->ctrl.ctrl_config & NVME_CC_ENABLE) > > > - nvme_dev_disable(dev, false); > > > + nvme_sync_queues(>ctrl); > > > + nvme_dev_disable(dev, false); > > > > That may not work reliably too. > > > > For example, request A from NS_0 is timed-out and handled as RESET_TIMER, > > meantime request B from NS_1 is timed-out and handled as EH_HANDLED. > > Meanwhile, request B's nvme_dev_disable prior to returning EH_HANDLED > cancels all requests, which includes request A. No, please see my above comment. Thanks, Ming
Re: [PATCH 3/6] nvme: Move all IO out of controller reset
On Mon, May 21, 2018 at 09:44:33AM -0600, Keith Busch wrote: > On Mon, May 21, 2018 at 11:34:27PM +0800, Ming Lei wrote: > > nvme_dev_disable() quiesces queues first before killing queues. > > > > If queues are quiesced during or before nvme_wait_freeze() is run > > from the 2nd part of reset, the 2nd part can't move on, and IO hang > > is caused. Finally no reset can be scheduled at all. > > But this patch moves nvme_wait_freeze outside the reset path, so I'm > afraid I'm unable to follow how you've concluded the wait freeze is > somehow part of the reset. For example: 1) the 1st timeout event: - nvme_dev_disable() - reset - scan_work 2) the 2nd timeout event: nvme_dev_disable() may come just after nvme_start_queues() in the above reset of the 1st timeout. And nvme_timeout() won't schedule a new reset since the controller state is NVME_CTRL_CONNECTING. Then scan_work in 1st timeout still may hang for ever. Thanks, Ming
[PATCH v5 4/5] fs: blkdev set bio prio from kiocb prio
From: Adam ManzanaresNow that kiocb has an ioprio field copy this over to the bio when it is created from the kiocb. Signed-off-by: Adam Manzanares --- fs/block_dev.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/block_dev.c b/fs/block_dev.c index 7ec920e27065..da1e94d2bb75 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -355,6 +355,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) bio->bi_write_hint = iocb->ki_hint; bio->bi_private = dio; bio->bi_end_io = blkdev_bio_end_io; + bio->bi_ioprio = iocb->ki_ioprio; ret = bio_iov_iter_get_pages(bio, iter); if (unlikely(ret)) { -- 2.15.1
[PATCH v5 3/5] fs: Add aio iopriority support
From: Adam ManzanaresThis is the per-I/O equivalent of the ioprio_set system call. When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field. We set the blkdev bio iopriority unconditionally, so we need to guarantee the kiocb is initialized properly. Added changes to the loopback driver and init_sync_kiocb to achieve this. This patch depends on block: add ioprio_check_cap function. Signed-off-by: Adam Manzanares --- drivers/block/loop.c | 3 +++ fs/aio.c | 16 include/linux/fs.h | 3 +++ include/uapi/linux/aio_abi.h | 1 + 4 files changed, 23 insertions(+) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 5d4e31655d96..dd98dfd97f5e 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -76,6 +76,8 @@ #include #include #include +#include + #include "loop.h" #include @@ -559,6 +561,7 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, cmd->iocb.ki_filp = file; cmd->iocb.ki_complete = lo_rw_aio_complete; cmd->iocb.ki_flags = IOCB_DIRECT; + cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0); if (cmd->css) kthread_associate_blkcg(cmd->css); diff --git a/fs/aio.c b/fs/aio.c index f3eae5d5771b..44b4572be524 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1451,6 +1451,22 @@ static int aio_prep_rw(struct kiocb *req, struct iocb *iocb) if (iocb->aio_flags & IOCB_FLAG_RESFD) req->ki_flags |= IOCB_EVENTFD; req->ki_hint = file_write_hint(req->ki_filp); + if (iocb->aio_flags & IOCB_FLAG_IOPRIO) { + /* +* If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then +* aio_reqprio is interpreted as an I/O scheduling +* class and priority. +*/ + ret = ioprio_check_cap(iocb->aio_reqprio); + if (ret) { + pr_debug("aio ioprio check cap error\n"); + return -EINVAL; + } + + req->ki_ioprio = iocb->aio_reqprio; + } else + req->ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0); + ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags); if (unlikely(ret)) fput(req->ki_filp); diff --git a/include/linux/fs.h b/include/linux/fs.h index 50de40dbbb85..73b749ed3ea1 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -36,6 +36,7 @@ #include #include #include +#include #include #include @@ -302,6 +303,7 @@ struct kiocb { void*private; int ki_flags; u16 ki_hint; + u16 ki_ioprio; /* See linux/ioprio.h */ } __randomize_layout; static inline bool is_sync_kiocb(struct kiocb *kiocb) @@ -1942,6 +1944,7 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp) .ki_filp = filp, .ki_flags = iocb_flags(filp), .ki_hint = ki_hint_validate(file_write_hint(filp)), + .ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0), }; } diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h index 2c0a3415beee..d4e768d55d14 100644 --- a/include/uapi/linux/aio_abi.h +++ b/include/uapi/linux/aio_abi.h @@ -55,6 +55,7 @@ enum { * is valid. */ #define IOCB_FLAG_RESFD(1 << 0) +#define IOCB_FLAG_IOPRIO (1 << 1) /* read() from /dev/aio returns these structures. */ struct io_event { -- 2.15.1
[PATCH v5 5/5] fs: iomap dio set bio prio from kiocb prio
From: Adam ManzanaresNow that kiocb has an ioprio field copy this over to the bio when it is created from the kiocb during direct IO. Signed-off-by: Adam Manzanares --- fs/iomap.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/iomap.c b/fs/iomap.c index afd163586aa0..65aae194aeca 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -919,6 +919,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length, bio->bi_iter.bi_sector = (iomap->addr + pos - iomap->offset) >> 9; bio->bi_write_hint = dio->iocb->ki_hint; + bio->bi_ioprio = dio->iocb->ki_ioprio; bio->bi_private = dio; bio->bi_end_io = iomap_dio_bio_end_io; -- 2.15.1
[PATCH v5 1/5] block: add ioprio_check_cap function
From: Adam ManzanaresAio per command iopriority support introduces a second interface between userland and the kernel capable of passing iopriority. The aio interface also needs the ability to verify that the submitting context has sufficient priviledges to submit IOPRIO_RT commands. This patch creates the ioprio_check_cap function to be used by the ioprio_set system call and also by the aio interface. Signed-off-by: Adam Manzanares Reviewed-by: Christoph Hellwig --- block/ioprio.c | 22 -- include/linux/ioprio.h | 2 ++ 2 files changed, 18 insertions(+), 6 deletions(-) diff --git a/block/ioprio.c b/block/ioprio.c index 6f5d0b6625e3..f9821080c92c 100644 --- a/block/ioprio.c +++ b/block/ioprio.c @@ -61,15 +61,10 @@ int set_task_ioprio(struct task_struct *task, int ioprio) } EXPORT_SYMBOL_GPL(set_task_ioprio); -SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio) +int ioprio_check_cap(int ioprio) { int class = IOPRIO_PRIO_CLASS(ioprio); int data = IOPRIO_PRIO_DATA(ioprio); - struct task_struct *p, *g; - struct user_struct *user; - struct pid *pgrp; - kuid_t uid; - int ret; switch (class) { case IOPRIO_CLASS_RT: @@ -92,6 +87,21 @@ SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio) return -EINVAL; } + return 0; +} + +SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio) +{ + struct task_struct *p, *g; + struct user_struct *user; + struct pid *pgrp; + kuid_t uid; + int ret; + + ret = ioprio_check_cap(ioprio); + if (ret) + return ret; + ret = -ESRCH; rcu_read_lock(); switch (which) { diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h index 627efac73e6d..4a28cec49ec3 100644 --- a/include/linux/ioprio.h +++ b/include/linux/ioprio.h @@ -77,4 +77,6 @@ extern int ioprio_best(unsigned short aprio, unsigned short bprio); extern int set_task_ioprio(struct task_struct *task, int ioprio); +extern int ioprio_check_cap(int ioprio); + #endif -- 2.15.1
[PATCH v5 2/5] fs: Convert kiocb rw_hint from enum to u16
From: Adam ManzanaresIn order to avoid kiocb bloat for per command iopriority support, rw_hint is converted from enum to a u16. Added a guard around ki_hint assigment. Signed-off-by: Adam Manzanares --- include/linux/fs.h | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 7f07977bdfd7..50de40dbbb85 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -284,6 +284,8 @@ enum rw_hint { WRITE_LIFE_EXTREME = RWH_WRITE_LIFE_EXTREME, }; +#define MAX_KI_HINT((1 << 16) - 1) /* ki_hint type is u16 */ + #define IOCB_EVENTFD (1 << 0) #define IOCB_APPEND(1 << 1) #define IOCB_DIRECT(1 << 2) @@ -299,7 +301,7 @@ struct kiocb { void (*ki_complete)(struct kiocb *iocb, long ret, long ret2); void*private; int ki_flags; - enum rw_hintki_hint; + u16 ki_hint; } __randomize_layout; static inline bool is_sync_kiocb(struct kiocb *kiocb) @@ -1927,12 +1929,19 @@ static inline enum rw_hint file_write_hint(struct file *file) static inline int iocb_flags(struct file *file); +static inline u16 ki_hint_validate(enum rw_hint hint) +{ + if (hint > MAX_KI_HINT) + return 0; + return hint; +} + static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp) { *kiocb = (struct kiocb) { .ki_filp = filp, .ki_flags = iocb_flags(filp), - .ki_hint = file_write_hint(filp), + .ki_hint = ki_hint_validate(file_write_hint(filp)), }; } -- 2.15.1
[PATCH v5 0/5] AIO add per-command iopriority
From: Adam ManzanaresThis is the per-I/O equivalent of the ioprio_set system call. See the following link for performance implications on a SATA HDD: https://lkml.org/lkml/2016/12/6/495 First patch factors ioprio_check_cap function out of ioprio_set system call to also be used by the aio ioprio interface. Second patch converts kiocb ki_hint field to a u16 to avoid kiocb bloat. Third patch passes ioprio hint from aio iocb to kiocb and inititalizes kiocb ioprio value appropriately when it is not explicitly set. Fourth patch enables the feature for blkdev. Fifth patch enables the feature for iomap direct IO Note: this work is based on top of linux-vfs/for-next v2: merge patches use IOCB_FLAG_IOPRIO validate intended use with IOCB_IOPRIO add linux-api and linux-block to cc v3: add ioprio_check_cap function convert kiocb ki_hint to u16 use ioprio_check_cap when adding ioprio to kiocb in aio.c v4: handle IOCB_IOPRIO in aio_prep_rw note patch 3 depends on patch 1 in commit msg v5: rename ki_hint_valid -> ki_hint_validate remove ki_hint_validate comment and whitespace remove IOCB_IOPRIO flag initialize kiocb to have no priority Adam Manzanares (5): block: add ioprio_check_cap function fs: Convert kiocb rw_hint from enum to u16 fs: Add aio iopriority support fs: blkdev set bio prio from kiocb prio fs: iomap dio set bio prio from kiocb prio block/ioprio.c | 22 -- drivers/block/loop.c | 3 +++ fs/aio.c | 16 fs/block_dev.c | 1 + fs/iomap.c | 1 + include/linux/fs.h | 16 ++-- include/linux/ioprio.h | 2 ++ include/uapi/linux/aio_abi.h | 1 + 8 files changed, 54 insertions(+), 8 deletions(-) -- 2.15.1
Re: [PATCH 1/6] nvme: Sync request queues on reset
On Mon, May 21, 2018 at 11:25:43PM +0800, Ming Lei wrote: > On Mon, May 21, 2018 at 08:04:13AM -0600, Keith Busch wrote: > > On Sat, May 19, 2018 at 08:01:42AM +0800, Ming Lei wrote: > > > > You keep saying that, but the controller state is global to the > > > > controller. It doesn't matter which namespace request_queue started the > > > > reset: every namespaces request queue sees the RESETTING controller > > > > state > > > > > > When timeouts come, the global state of RESETTING may not be updated > > > yet, so all the timeouts may not observe the state. > > > > Even prior to the RESETING state, every single command, no matter > > which namespace or request_queue it came on, is reclaimed by the driver. > > There *should* be no requests to timeout after nvme_dev_disable is called > > because the nvme driver returned control of all requests in the tagset > > to blk-mq. > > The timed-out requests won't be canceled by nvme_dev_disable(). ??? nvme_dev_disable cancels all started requests. There are no exceptions. > If the timed-out requests is handled as RESET_TIMER, there may > be new timeout event triggered again. > > > > > In any case, if blk-mq decides it won't complete those requests, we > > can just swap the order in the reset_work: sync first, uncondintionally > > disable. Does the following snippet look more okay? > > > > --- > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > > index 17a0190bd88f..42af077ee07a 100644 > > --- a/drivers/nvme/host/pci.c > > +++ b/drivers/nvme/host/pci.c > > @@ -2307,11 +2307,14 @@ static void nvme_reset_work(struct work_struct > > *work) > > goto out; > > > > /* > > -* If we're called to reset a live controller first shut it down before > > -* moving on. > > +* Ensure there are no timeout work in progress prior to forcefully > > +* disabling the queue. There is no harm in disabling the device even > > +* when it was already disabled, as this will forcefully reclaim any > > +* IOs that are stuck due to blk-mq's timeout handling that prevents > > +* timed out requests from completing. > > */ > > - if (dev->ctrl.ctrl_config & NVME_CC_ENABLE) > > - nvme_dev_disable(dev, false); > > + nvme_sync_queues(>ctrl); > > + nvme_dev_disable(dev, false); > > That may not work reliably too. > > For example, request A from NS_0 is timed-out and handled as RESET_TIMER, > meantime request B from NS_1 is timed-out and handled as EH_HANDLED. Meanwhile, request B's nvme_dev_disable prior to returning EH_HANDLED cancels all requests, which includes request A. > When the above reset work is run for handling timeout of req B, > new timeout event on request A may come just between the above > nvme_sync_queues() and nvme_dev_disable() The sync queues either stops the timer from running, or waits for it to complete. We are in the RESETTING state: if request A's timeout happens to be running, we're not restarting the timer; we're returning EH_HANDLED. > then nvme_dev_disable() > can't cover request A, and finally the timed-out event for req A > will nvme_dev_disable() when the current reset is just in-progress, > then this reset can't move on, and IO hang is caused. At no point in the nvme_reset are we waiting for any IO to complete. Reset continues to make forward progress.
Re: [PATCH 3/6] nvme: Move all IO out of controller reset
On Mon, May 21, 2018 at 11:34:27PM +0800, Ming Lei wrote: > nvme_dev_disable() quiesces queues first before killing queues. > > If queues are quiesced during or before nvme_wait_freeze() is run > from the 2nd part of reset, the 2nd part can't move on, and IO hang > is caused. Finally no reset can be scheduled at all. But this patch moves nvme_wait_freeze outside the reset path, so I'm afraid I'm unable to follow how you've concluded the wait freeze is somehow part of the reset.
Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()
On 5/21/18 9:18 AM, Mike Snitzer wrote: > On Mon, May 21 2018 at 11:09am -0400, > Jens Axboewrote: > >> On 5/21/18 9:04 AM, Mike Snitzer wrote: >>> On Mon, May 21 2018 at 10:52am -0400, >>> Jens Axboe wrote: >>> On 5/21/18 8:47 AM, Mike Snitzer wrote: > On Mon, May 21 2018 at 10:36am -0400, > Jens Axboe wrote: > >> On 5/21/18 8:31 AM, Mike Snitzer wrote: >>> On Mon, May 21 2018 at 10:19am -0400, >>> Jens Axboe wrote: >>> On 5/21/18 8:03 AM, Mike Snitzer wrote: > On Sun, May 20 2018 at 6:25pm -0400, > Kent Overstreet wrote: > >> Jens - this series does the rest of the conversions that Christoph >> wanted, and >> drops bioset_create(). >> >> Only lightly tested, but the changes are pretty mechanical. Based on >> your >> for-next tree. > > By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set' > you've altered the alignment of members in data structures. So I'll > need to audit all the data structures you've modified in DM. > > Could we get the backstory on _why_ you're making this change? > Would go a long way to helping me appreciate why this is a good use of > anyone's time. Yeah, it's in the first series, it gets rid of a pointer indirection. >>> >>> "Allows mempools to be embedded in other structs, getting rid of a >>> pointer indirection from allocation fastpaths." >>> >>> So this is about using contiguous memory or avoiding partial allocation >>> failure? Or both? >>> >>> Or more to it? Just trying to fully appreciate the theory behind the >>> perceived associated benefit. >> >> It's about avoiding a pointer indirection. Instead of having to >> follow a pointer to get to that struct, it's simple offset math off >> your main structure. >> >>> I do think the increased risk of these embedded bio_set and mempool_t >>> themselves crossing cachelines, or struct members that follow them doing >>> so, really detracts from these types of changes. >> >> Definitely something to look out for, though most of them should be >> per-dev structures and not in-flight structures. That makes it a bit >> less sensitive. But can't hurt to audit the layouts and adjust if >> necessary. This is why it's posted for review :-) > > This isn't something that is easily caught upfront. Yes we can all be > busy little beavers with pahole to audit alignment. But chances are > most people won't do it. > > Reality is there is potential for a regression due to false sharing to > creep in if a hot struct member suddenly starts straddling a cacheline. > That type of NUMA performance killer is pretty insidious and somewhat > tedious to hunt down even when looking for it with specialized tools: > https://joemario.github.io/blog/2016/09/01/c2c-blog/ IMHO you're making a big deal out of something that should not be. >>> >>> I raised an issue that had seemingly not been considered at all. Not >>> making a big deal. Raising it for others' benefit. >>> If the dm bits are that sensitive and cache line honed to perfection already due to previous regressions in that area, then it might not be a bad idea to have some compile checks for false cacheline sharing between sensitive members, or spilling of a sub-struct into multiple cachelines. It's not like this was pushed behind your back. It's posted for review. It's quite possible the net change is a win for dm. Let's focus on getting it reviewed, rather than pontificate on what could potentially go all wrong with this. >>> >>> Why are you making this personal? Or purely about DM? I'm merely >>> pointing out this change isn't something that can be given a quick >>> blanket "looks good". >> >> I'm not making this personal at all?! You raised a (valid) concern, >> I'm merely stating why I don't think it's a high risk issue. I'm >> assuming your worry is related to dm, as those are the reports >> that would ultimately land on your desk. > > Then we'll just agree to disagree with what this implies: "It's not like > this was pushed behind your back." I'm afraid you've lost me now - it was not pushed behind your back, it was posted for review, with you on the CC list. Not trying to be deliberately dense here, I just don't see what our disagreement is. > Reality is I'm fine with the change. Just think there is follow-on work > (now or later) that is needed. It's not hard to run the quick struct layout checks or alignment. If there's a concern, that should be done now, instead of being deferred to later. There's no point merging something that we expect to have follow-on work. If that's the
Re: [PATCH v11 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64
On Fri, 2018-05-18 at 11:32 -0700, h...@zytor.com wrote: > On May 18, 2018 11:00:05 AM PDT, Bart Van Assche> wrote: > > The next patch in this series introduces a call to cmpxchg64() > > in the block layer core for those architectures on which this > > functionality is available. Make it possible to test whether > > cmpxchg64() is available by introducing CONFIG_ARCH_HAVE_CMPXCHG64. > > > > Signed-off-by: Bart Van Assche > > Cc: Catalin Marinas > > Cc: Will Deacon > > Cc: Tony Luck > > Cc: Fenghua Yu > > Cc: Geert Uytterhoeven > > Cc: "James E.J. Bottomley" > > Cc: Helge Deller > > Cc: Benjamin Herrenschmidt > > Cc: Paul Mackerras > > Cc: Michael Ellerman > > Cc: Martin Schwidefsky > > Cc: Heiko Carstens > > Cc: David S. Miller > > Cc: Thomas Gleixner > > Cc: Ingo Molnar > > Cc: H. Peter Anvin > > Cc: Chris Zankel > > Cc: Max Filippov > > Cc: Arnd Bergmann > > Cc: Jonathan Corbet > > --- > > .../features/locking/cmpxchg64/arch-support.txt| 33 > > ++ > > arch/Kconfig | 4 +++ > > arch/arm/Kconfig | 1 + > > arch/ia64/Kconfig | 1 + > > arch/m68k/Kconfig | 1 + > > arch/mips/Kconfig | 1 + > > arch/parisc/Kconfig| 1 + > > arch/riscv/Kconfig | 1 + > > arch/sparc/Kconfig | 1 + > > arch/x86/Kconfig | 1 + > > arch/xtensa/Kconfig| 1 + > > 11 files changed, 46 insertions(+) > > create mode 100644 > > Documentation/features/locking/cmpxchg64/arch-support.txt > > > > diff --git a/Documentation/features/locking/cmpxchg64/arch-support.txt > > b/Documentation/features/locking/cmpxchg64/arch-support.txt > > new file mode 100644 > > index ..84bfef7242b2 > > --- /dev/null > > +++ b/Documentation/features/locking/cmpxchg64/arch-support.txt > > @@ -0,0 +1,33 @@ > > +# > > +# Feature name: cmpxchg64 > > +# Kconfig: ARCH_HAVE_CMPXCHG64 > > +# description: arch supports the cmpxchg64() API > > +# > > +--- > > +| arch |status| > > +--- > > +| alpha: | ok | > > +| arc: | .. | > > +| arm: | ok | > > +| arm64: | ok | > > +| c6x: | .. | > > +| h8300: | .. | > > +| hexagon: | .. | > > +|ia64: | ok | > > +|m68k: | ok | > > +| microblaze: | .. | > > +|mips: | ok | > > +| nds32: | .. | > > +| nios2: | .. | > > +|openrisc: | .. | > > +| parisc: | ok | > > +| powerpc: | ok | > > +| riscv: | ok | > > +|s390: | ok | > > +| sh: | .. | > > +| sparc: | ok | > > +| um: | .. | > > +| unicore32: | .. | > > +| x86: | ok | > > +| xtensa: | ok | > > +--- > > diff --git a/arch/Kconfig b/arch/Kconfig > > index 8e0d665c8d53..9840b2577af1 100644 > > --- a/arch/Kconfig > > +++ b/arch/Kconfig > > @@ -358,6 +358,10 @@ config HAVE_ALIGNED_STRUCT_PAGE > > on a struct page for better performance. However selecting this > > might increase the size of a struct page by a word. > > > > +config ARCH_HAVE_CMPXCHG64 > > + bool > > + default y if 64BIT > > + > > config HAVE_CMPXCHG_LOCAL > > bool > > > > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig > > index a7f8e7f4b88f..02c75697176e 100644 > > --- a/arch/arm/Kconfig > > +++ b/arch/arm/Kconfig > > @@ -13,6 +13,7 @@ config ARM > > select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL > > select ARCH_HAS_STRICT_MODULE_RWX if MMU > > select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST > > + select ARCH_HAVE_CMPXCHG64 if !THUMB2_KERNEL > > select ARCH_HAVE_CUSTOM_GPIO_H > > select ARCH_HAS_GCOV_PROFILE_ALL > > select ARCH_MIGHT_HAVE_PC_PARPORT > > diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig > > index bbe12a038d21..31c49e1482e2 100644 > > --- a/arch/ia64/Kconfig > > +++ b/arch/ia64/Kconfig > > @@ -41,6 +41,7 @@ config IA64 > > select GENERIC_PENDING_IRQ if SMP > > select GENERIC_IRQ_SHOW > > select GENERIC_IRQ_LEGACY > > + select ARCH_HAVE_CMPXCHG64 > > select ARCH_HAVE_NMI_SAFE_CMPXCHG
Re: [PATCH 3/6] nvme: Move all IO out of controller reset
On Mon, May 21, 2018 at 09:03:31AM -0600, Keith Busch wrote: > On Mon, May 21, 2018 at 10:58:51PM +0800, Ming Lei wrote: > > On Mon, May 21, 2018 at 08:22:19AM -0600, Keith Busch wrote: > > > On Sat, May 19, 2018 at 07:03:58AM +0800, Ming Lei wrote: > > > > On Fri, May 18, 2018 at 10:38:20AM -0600, Keith Busch wrote: > > > > > + > > > > > + if (unfreeze) > > > > > + nvme_wait_freeze(>ctrl); > > > > > + > > > > > > > > timeout may comes just before blk_mq_update_nr_hw_queues() or > > > > the above nvme_wait_freeze(), then both two may hang forever. > > > > > > Why would it hang forever? The scan_work doesn't stop a timeout from > > > triggering a reset to reclaim requests necessary to complete a freeze. > > > > nvme_dev_disable() will quiesce queues, then nvme_wait_freeze() or > > blk_mq_update_nr_hw_queues() may hang forever. > > nvme_dev_disable is just the first part of the timeout sequence. You > have to follow it through to the reset_work that either restarts or > kills the queues. nvme_dev_disable() quiesces queues first before killing queues. If queues are quiesced during or before nvme_wait_freeze() is run from the 2nd part of reset, the 2nd part can't move on, and IO hang is caused. Finally no reset can be scheduled at all. Thanks, Ming
Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
On Mon, May 21, 2018 at 08:16:59AM -0700, Linus Torvalds wrote: > On Mon, May 21, 2018 at 6:51 AM Roman Penyaev < > roman.peny...@profitbricks.com> wrote: > > > No, I continue from the pointer, which I assigned on the previous IO > > in order to send IO fairly and keep load balanced. > > Right. And that's exactly what has both me and Paul nervous. You're no > longer in the RCU domain. You're using a pointer where the lifetime has > nothing to do with RCU any more. > > Can it be done? Sure. But you need *other* locking for it (that you haven't > explained), and it's fragile as hell. He looks to actually have it right, but I would want to see a big comment on the read side noting the leak of the pointer and documenting why it is OK. Thanx, Paul > It's probably best to not use RCU for it at all, but depend on that "other > locking" that you have to have anyway, to keep the pointer valid over the > non-RCU region. > > Linus >
Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
On Mon, May 21, 2018 at 03:50:10PM +0200, Roman Penyaev wrote: > On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenney >wrote: > > On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote: > >> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney > >> wrote: > >> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote: > >> >> Function is going to be used in transport over RDMA module > >> >> in subsequent patches. > >> >> > >> >> Function returns next element in round-robin fashion, > >> >> i.e. head will be skipped. NULL will be returned if list > >> >> is observed as empty. > >> >> > >> >> Signed-off-by: Roman Pen > >> >> Cc: Paul E. McKenney > >> >> Cc: linux-ker...@vger.kernel.org > >> >> --- > >> >> include/linux/rculist.h | 19 +++ > >> >> 1 file changed, 19 insertions(+) > >> >> > >> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h > >> >> index 127f534fec94..b0840d5ab25a 100644 > >> >> --- a/include/linux/rculist.h > >> >> +++ b/include/linux/rculist.h > >> >> @@ -339,6 +339,25 @@ static inline void > >> >> list_splice_tail_init_rcu(struct list_head *list, > >> >> }) > >> >> > >> >> /** > >> >> + * list_next_or_null_rr_rcu - get next list element in round-robin > >> >> fashion. > >> >> + * @head:the head for the list. > >> >> + * @ptr:the list head to take the next element from. > >> >> + * @type: the type of the struct this is embedded in. > >> >> + * @memb: the name of the list_head within the struct. > >> >> + * > >> >> + * Next element returned in round-robin fashion, i.e. head will be > >> >> skipped, > >> >> + * but if list is observed as empty, NULL will be returned. > >> >> + * > >> >> + * This primitive may safely run concurrently with the _rcu > >> >> list-mutation > >> >> + * primitives such as list_add_rcu() as long as it's guarded by > >> >> rcu_read_lock(). > >> > > >> > Of course, all the set of list_next_or_null_rr_rcu() invocations that > >> > are round-robining a given list must all be under the same RCU read-side > >> > critical section. For example, the following will break badly: > >> > > >> > struct foo *take_rr_step(struct list_head *head, struct foo *ptr) > >> > { > >> > struct foo *ret; > >> > > >> > rcu_read_lock(); > >> > ret = list_next_or_null_rr_rcu(head, ptr, struct foo, > >> > foolist); > >> > rcu_read_unlock(); /* BUG */ > >> > return ret; > >> > } > >> > > >> > You need a big fat comment stating this, at the very least. The > >> > resulting > >> > bug can be very hard to trigger and even harder to debug. > >> > > >> > And yes, I know that the same restriction applies to list_next_rcu() > >> > and friends. The difference is that if you try to invoke those in an > >> > infinite loop, you will be rapped on the knuckles as soon as you hit > >> > the list header. Without that knuckle-rapping, RCU CPU stall warnings > >> > might tempt people to do something broken like take_rr_step() above. > >> > >> Hi Paul, > >> > >> I need -rr behaviour for doing IO load-balancing when I choose next RDMA > >> connection from the list in order to send a request, i.e. my code is > >> something like the following: > >> > >> static struct conn *get_and_set_next_conn(void) > >> { > >> struct conn *conn; > >> > >> conn = rcu_dereferece(rcu_conn); > >> if (unlikely(!conn)) > >> return conn; > > > > Wait. Don't you need to restart from the beginning of the list in > > this case? Or does the list never have anything added to it and is > > rcu_conn initially the first element in the list? > > Hi Paul, > > No, I continue from the pointer, which I assigned on the previous IO > in order to send IO fairly and keep load balanced. > > Initially @rcu_conn points to the first element, but elements can > be deleted from the list and list can become empty. > > The deletion code is below. > > > > >> conn = list_next_or_null_rr_rcu(_list, > >> >entry, > >> typeof(*conn), > >> entry); > >> rcu_assign_pointer(rcu_conn, conn); > > > > Linus is correct to doubt this code. You assign a pointer to the current > > element to rcu_conn, which is presumably a per-CPU or global variable. > > So far, so good ... > > I use per-CPU, in the first example I did not show that not to overcomplicate > the code. > > > > >> return conn; > >> } > >> > >> rcu_read_lock(); > >> conn = get_and_set_next_conn(); > >> if (unlikely(!conn)) { > >> /* ... */ > >> } > >> err = rdma_io(conn,
Re: [PATCH 1/6] nvme: Sync request queues on reset
On Mon, May 21, 2018 at 08:04:13AM -0600, Keith Busch wrote: > On Sat, May 19, 2018 at 08:01:42AM +0800, Ming Lei wrote: > > > You keep saying that, but the controller state is global to the > > > controller. It doesn't matter which namespace request_queue started the > > > reset: every namespaces request queue sees the RESETTING controller state > > > > When timeouts come, the global state of RESETTING may not be updated > > yet, so all the timeouts may not observe the state. > > Even prior to the RESETING state, every single command, no matter > which namespace or request_queue it came on, is reclaimed by the driver. > There *should* be no requests to timeout after nvme_dev_disable is called > because the nvme driver returned control of all requests in the tagset > to blk-mq. The timed-out requests won't be canceled by nvme_dev_disable(). If the timed-out requests is handled as RESET_TIMER, there may be new timeout event triggered again. > > In any case, if blk-mq decides it won't complete those requests, we > can just swap the order in the reset_work: sync first, uncondintionally > disable. Does the following snippet look more okay? > > --- > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 17a0190bd88f..42af077ee07a 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -2307,11 +2307,14 @@ static void nvme_reset_work(struct work_struct *work) > goto out; > > /* > - * If we're called to reset a live controller first shut it down before > - * moving on. > + * Ensure there are no timeout work in progress prior to forcefully > + * disabling the queue. There is no harm in disabling the device even > + * when it was already disabled, as this will forcefully reclaim any > + * IOs that are stuck due to blk-mq's timeout handling that prevents > + * timed out requests from completing. >*/ > - if (dev->ctrl.ctrl_config & NVME_CC_ENABLE) > - nvme_dev_disable(dev, false); > + nvme_sync_queues(>ctrl); > + nvme_dev_disable(dev, false); That may not work reliably too. For example, request A from NS_0 is timed-out and handled as RESET_TIMER, meantime request B from NS_1 is timed-out and handled as EH_HANDLED. When the above reset work is run for handling timeout of req B, new timeout event on request A may come just between the above nvme_sync_queues() and nvme_dev_disable(), then nvme_dev_disable() can't cover request A, and finally the timed-out event for req A will nvme_dev_disable() when the current reset is just in-progress, then this reset can't move on, and IO hang is caused. Thanks, Ming
Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()
On 5/21/18 9:12 AM, David Sterba wrote: > On Mon, May 21, 2018 at 08:19:58AM -0600, Jens Axboe wrote: >> On 5/21/18 8:03 AM, Mike Snitzer wrote: >>> On Sun, May 20 2018 at 6:25pm -0400, >>> Kent Overstreetwrote: >>> Jens - this series does the rest of the conversions that Christoph wanted, and drops bioset_create(). Only lightly tested, but the changes are pretty mechanical. Based on your for-next tree. >>> >>> By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set' >>> you've altered the alignment of members in data structures. So I'll >>> need to audit all the data structures you've modified in DM. >>> >>> Could we get the backstory on _why_ you're making this change? >>> Would go a long way to helping me appreciate why this is a good use of >>> anyone's time. >> >> Yeah, it's in the first series, it gets rid of a pointer indirection. > > This should to be also mentioned the changelog of each patch. There are > 12 subsystems changed, this could be about 10 maintainers and I guess > everybody has the same question why the change is made. Agree, the justification should be in this series as well, of course. Kent, might not be a bad idea to resend with a more descriptive cover letter. -- Jens Axboe
Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()
On Mon, May 21 2018 at 11:09am -0400, Jens Axboewrote: > On 5/21/18 9:04 AM, Mike Snitzer wrote: > > On Mon, May 21 2018 at 10:52am -0400, > > Jens Axboe wrote: > > > >> On 5/21/18 8:47 AM, Mike Snitzer wrote: > >>> On Mon, May 21 2018 at 10:36am -0400, > >>> Jens Axboe wrote: > >>> > On 5/21/18 8:31 AM, Mike Snitzer wrote: > > On Mon, May 21 2018 at 10:19am -0400, > > Jens Axboe wrote: > > > >> On 5/21/18 8:03 AM, Mike Snitzer wrote: > >>> On Sun, May 20 2018 at 6:25pm -0400, > >>> Kent Overstreet wrote: > >>> > Jens - this series does the rest of the conversions that Christoph > wanted, and > drops bioset_create(). > > Only lightly tested, but the changes are pretty mechanical. Based on > your > for-next tree. > >>> > >>> By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set' > >>> you've altered the alignment of members in data structures. So I'll > >>> need to audit all the data structures you've modified in DM. > >>> > >>> Could we get the backstory on _why_ you're making this change? > >>> Would go a long way to helping me appreciate why this is a good use of > >>> anyone's time. > >> > >> Yeah, it's in the first series, it gets rid of a pointer indirection. > > > > "Allows mempools to be embedded in other structs, getting rid of a > > pointer indirection from allocation fastpaths." > > > > So this is about using contiguous memory or avoiding partial allocation > > failure? Or both? > > > > Or more to it? Just trying to fully appreciate the theory behind the > > perceived associated benefit. > > It's about avoiding a pointer indirection. Instead of having to > follow a pointer to get to that struct, it's simple offset math off > your main structure. > > > I do think the increased risk of these embedded bio_set and mempool_t > > themselves crossing cachelines, or struct members that follow them doing > > so, really detracts from these types of changes. > > Definitely something to look out for, though most of them should be > per-dev structures and not in-flight structures. That makes it a bit > less sensitive. But can't hurt to audit the layouts and adjust if > necessary. This is why it's posted for review :-) > >>> > >>> This isn't something that is easily caught upfront. Yes we can all be > >>> busy little beavers with pahole to audit alignment. But chances are > >>> most people won't do it. > >>> > >>> Reality is there is potential for a regression due to false sharing to > >>> creep in if a hot struct member suddenly starts straddling a cacheline. > >>> That type of NUMA performance killer is pretty insidious and somewhat > >>> tedious to hunt down even when looking for it with specialized tools: > >>> https://joemario.github.io/blog/2016/09/01/c2c-blog/ > >> > >> IMHO you're making a big deal out of something that should not be. > > > > I raised an issue that had seemingly not been considered at all. Not > > making a big deal. Raising it for others' benefit. > > > >> If the dm bits are that sensitive and cache line honed to perfection > >> already due to previous regressions in that area, then it might > >> not be a bad idea to have some compile checks for false cacheline > >> sharing between sensitive members, or spilling of a sub-struct > >> into multiple cachelines. > >> > >> It's not like this was pushed behind your back. It's posted for > >> review. It's quite possible the net change is a win for dm. Let's > >> focus on getting it reviewed, rather than pontificate on what > >> could potentially go all wrong with this. > > > > Why are you making this personal? Or purely about DM? I'm merely > > pointing out this change isn't something that can be given a quick > > blanket "looks good". > > I'm not making this personal at all?! You raised a (valid) concern, > I'm merely stating why I don't think it's a high risk issue. I'm > assuming your worry is related to dm, as those are the reports > that would ultimately land on your desk. Then we'll just agree to disagree with what this implies: "It's not like this was pushed behind your back." Reality is I'm fine with the change. Just think there is follow-on work (now or later) that is needed. Enough said.
Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
On Mon, May 21, 2018 at 6:51 AM Roman Penyaev < roman.peny...@profitbricks.com> wrote: > No, I continue from the pointer, which I assigned on the previous IO > in order to send IO fairly and keep load balanced. Right. And that's exactly what has both me and Paul nervous. You're no longer in the RCU domain. You're using a pointer where the lifetime has nothing to do with RCU any more. Can it be done? Sure. But you need *other* locking for it (that you haven't explained), and it's fragile as hell. It's probably best to not use RCU for it at all, but depend on that "other locking" that you have to have anyway, to keep the pointer valid over the non-RCU region. Linus
Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()
On Mon, May 21, 2018 at 08:19:58AM -0600, Jens Axboe wrote: > On 5/21/18 8:03 AM, Mike Snitzer wrote: > > On Sun, May 20 2018 at 6:25pm -0400, > > Kent Overstreetwrote: > > > >> Jens - this series does the rest of the conversions that Christoph wanted, > >> and > >> drops bioset_create(). > >> > >> Only lightly tested, but the changes are pretty mechanical. Based on your > >> for-next tree. > > > > By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set' > > you've altered the alignment of members in data structures. So I'll > > need to audit all the data structures you've modified in DM. > > > > Could we get the backstory on _why_ you're making this change? > > Would go a long way to helping me appreciate why this is a good use of > > anyone's time. > > Yeah, it's in the first series, it gets rid of a pointer indirection. This should to be also mentioned the changelog of each patch. There are 12 subsystems changed, this could be about 10 maintainers and I guess everybody has the same question why the change is made. The conversion is not exactly the same in all patches, the simple pointer -> static variable can be possibly covered by the same generic text but as Mike points out the alignment changes should be at least mentioned for consideration otherwise.
Re: [PATCH 00/10] Misc block layer patches for bcachefs
On Sun, 2018-05-20 at 19:58 -0400, Kent Overstreet wrote: > On Sun, May 20, 2018 at 11:40:45PM +, Bart Van Assche wrote: > > On Sun, 2018-05-20 at 19:21 -0400, Kent Overstreet wrote: > > > I really have better things to do than debug someone else's tests... > > > [ ... ] > > > ../run_tests: line 65: cd: /lib/modules/4.16.0+/kernel/block: No such > > > file or directory > > > > Kernel v4.16 is too old to run these tests. The srp-test script needs the > > following commit that went upstream in kernel v4.17-rc1: > > > > 63cf1a902c9d ("IB/srpt: Add RDMA/CM support") > > Same output on Jens' for-next branch. Others have been able to run the srp-test software with the instructions provided earlier in this e-mail thread. Can you share the kernel messages from around the time the test was run (dmesg, /var/log/messages or /var/log/syslog)? Thanks, Bart.
Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()
On 5/21/18 9:04 AM, Mike Snitzer wrote: > On Mon, May 21 2018 at 10:52am -0400, > Jens Axboewrote: > >> On 5/21/18 8:47 AM, Mike Snitzer wrote: >>> On Mon, May 21 2018 at 10:36am -0400, >>> Jens Axboe wrote: >>> On 5/21/18 8:31 AM, Mike Snitzer wrote: > On Mon, May 21 2018 at 10:19am -0400, > Jens Axboe wrote: > >> On 5/21/18 8:03 AM, Mike Snitzer wrote: >>> On Sun, May 20 2018 at 6:25pm -0400, >>> Kent Overstreet wrote: >>> Jens - this series does the rest of the conversions that Christoph wanted, and drops bioset_create(). Only lightly tested, but the changes are pretty mechanical. Based on your for-next tree. >>> >>> By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set' >>> you've altered the alignment of members in data structures. So I'll >>> need to audit all the data structures you've modified in DM. >>> >>> Could we get the backstory on _why_ you're making this change? >>> Would go a long way to helping me appreciate why this is a good use of >>> anyone's time. >> >> Yeah, it's in the first series, it gets rid of a pointer indirection. > > "Allows mempools to be embedded in other structs, getting rid of a > pointer indirection from allocation fastpaths." > > So this is about using contiguous memory or avoiding partial allocation > failure? Or both? > > Or more to it? Just trying to fully appreciate the theory behind the > perceived associated benefit. It's about avoiding a pointer indirection. Instead of having to follow a pointer to get to that struct, it's simple offset math off your main structure. > I do think the increased risk of these embedded bio_set and mempool_t > themselves crossing cachelines, or struct members that follow them doing > so, really detracts from these types of changes. Definitely something to look out for, though most of them should be per-dev structures and not in-flight structures. That makes it a bit less sensitive. But can't hurt to audit the layouts and adjust if necessary. This is why it's posted for review :-) >>> >>> This isn't something that is easily caught upfront. Yes we can all be >>> busy little beavers with pahole to audit alignment. But chances are >>> most people won't do it. >>> >>> Reality is there is potential for a regression due to false sharing to >>> creep in if a hot struct member suddenly starts straddling a cacheline. >>> That type of NUMA performance killer is pretty insidious and somewhat >>> tedious to hunt down even when looking for it with specialized tools: >>> https://joemario.github.io/blog/2016/09/01/c2c-blog/ >> >> IMHO you're making a big deal out of something that should not be. > > I raised an issue that had seemingly not been considered at all. Not > making a big deal. Raising it for others' benefit. > >> If the dm bits are that sensitive and cache line honed to perfection >> already due to previous regressions in that area, then it might >> not be a bad idea to have some compile checks for false cacheline >> sharing between sensitive members, or spilling of a sub-struct >> into multiple cachelines. >> >> It's not like this was pushed behind your back. It's posted for >> review. It's quite possible the net change is a win for dm. Let's >> focus on getting it reviewed, rather than pontificate on what >> could potentially go all wrong with this. > > Why are you making this personal? Or purely about DM? I'm merely > pointing out this change isn't something that can be given a quick > blanket "looks good". I'm not making this personal at all?! You raised a (valid) concern, I'm merely stating why I don't think it's a high risk issue. I'm assuming your worry is related to dm, as those are the reports that would ultimately land on your desk. -- Jens Axboe
Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()
On Mon, May 21 2018 at 10:52am -0400, Jens Axboewrote: > On 5/21/18 8:47 AM, Mike Snitzer wrote: > > On Mon, May 21 2018 at 10:36am -0400, > > Jens Axboe wrote: > > > >> On 5/21/18 8:31 AM, Mike Snitzer wrote: > >>> On Mon, May 21 2018 at 10:19am -0400, > >>> Jens Axboe wrote: > >>> > On 5/21/18 8:03 AM, Mike Snitzer wrote: > > On Sun, May 20 2018 at 6:25pm -0400, > > Kent Overstreet wrote: > > > >> Jens - this series does the rest of the conversions that Christoph > >> wanted, and > >> drops bioset_create(). > >> > >> Only lightly tested, but the changes are pretty mechanical. Based on > >> your > >> for-next tree. > > > > By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set' > > you've altered the alignment of members in data structures. So I'll > > need to audit all the data structures you've modified in DM. > > > > Could we get the backstory on _why_ you're making this change? > > Would go a long way to helping me appreciate why this is a good use of > > anyone's time. > > Yeah, it's in the first series, it gets rid of a pointer indirection. > >>> > >>> "Allows mempools to be embedded in other structs, getting rid of a > >>> pointer indirection from allocation fastpaths." > >>> > >>> So this is about using contiguous memory or avoiding partial allocation > >>> failure? Or both? > >>> > >>> Or more to it? Just trying to fully appreciate the theory behind the > >>> perceived associated benefit. > >> > >> It's about avoiding a pointer indirection. Instead of having to > >> follow a pointer to get to that struct, it's simple offset math off > >> your main structure. > >> > >>> I do think the increased risk of these embedded bio_set and mempool_t > >>> themselves crossing cachelines, or struct members that follow them doing > >>> so, really detracts from these types of changes. > >> > >> Definitely something to look out for, though most of them should be > >> per-dev structures and not in-flight structures. That makes it a bit > >> less sensitive. But can't hurt to audit the layouts and adjust if > >> necessary. This is why it's posted for review :-) > > > > This isn't something that is easily caught upfront. Yes we can all be > > busy little beavers with pahole to audit alignment. But chances are > > most people won't do it. > > > > Reality is there is potential for a regression due to false sharing to > > creep in if a hot struct member suddenly starts straddling a cacheline. > > That type of NUMA performance killer is pretty insidious and somewhat > > tedious to hunt down even when looking for it with specialized tools: > > https://joemario.github.io/blog/2016/09/01/c2c-blog/ > > IMHO you're making a big deal out of something that should not be. I raised an issue that had seemingly not been considered at all. Not making a big deal. Raising it for others' benefit. > If the dm bits are that sensitive and cache line honed to perfection > already due to previous regressions in that area, then it might > not be a bad idea to have some compile checks for false cacheline > sharing between sensitive members, or spilling of a sub-struct > into multiple cachelines. > > It's not like this was pushed behind your back. It's posted for > review. It's quite possible the net change is a win for dm. Let's > focus on getting it reviewed, rather than pontificate on what > could potentially go all wrong with this. Why are you making this personal? Or purely about DM? I'm merely pointing out this change isn't something that can be given a quick blanket "looks good". Mike
Re: [PATCH 3/6] nvme: Move all IO out of controller reset
On Mon, May 21, 2018 at 10:58:51PM +0800, Ming Lei wrote: > On Mon, May 21, 2018 at 08:22:19AM -0600, Keith Busch wrote: > > On Sat, May 19, 2018 at 07:03:58AM +0800, Ming Lei wrote: > > > On Fri, May 18, 2018 at 10:38:20AM -0600, Keith Busch wrote: > > > > + > > > > + if (unfreeze) > > > > + nvme_wait_freeze(>ctrl); > > > > + > > > > > > timeout may comes just before blk_mq_update_nr_hw_queues() or > > > the above nvme_wait_freeze(), then both two may hang forever. > > > > Why would it hang forever? The scan_work doesn't stop a timeout from > > triggering a reset to reclaim requests necessary to complete a freeze. > > nvme_dev_disable() will quiesce queues, then nvme_wait_freeze() or > blk_mq_update_nr_hw_queues() may hang forever. nvme_dev_disable is just the first part of the timeout sequence. You have to follow it through to the reset_work that either restarts or kills the queues.
Re: [PATCH 3/6] nvme: Move all IO out of controller reset
On Mon, May 21, 2018 at 08:22:19AM -0600, Keith Busch wrote: > On Sat, May 19, 2018 at 07:03:58AM +0800, Ming Lei wrote: > > On Fri, May 18, 2018 at 10:38:20AM -0600, Keith Busch wrote: > > > + > > > + if (unfreeze) > > > + nvme_wait_freeze(>ctrl); > > > + > > > > timeout may comes just before blk_mq_update_nr_hw_queues() or > > the above nvme_wait_freeze(), then both two may hang forever. > > Why would it hang forever? The scan_work doesn't stop a timeout from > triggering a reset to reclaim requests necessary to complete a freeze. nvme_dev_disable() will quiesce queues, then nvme_wait_freeze() or blk_mq_update_nr_hw_queues() may hang forever. Thanks, Ming
Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()
On 5/21/18 8:47 AM, Mike Snitzer wrote: > On Mon, May 21 2018 at 10:36am -0400, > Jens Axboewrote: > >> On 5/21/18 8:31 AM, Mike Snitzer wrote: >>> On Mon, May 21 2018 at 10:19am -0400, >>> Jens Axboe wrote: >>> On 5/21/18 8:03 AM, Mike Snitzer wrote: > On Sun, May 20 2018 at 6:25pm -0400, > Kent Overstreet wrote: > >> Jens - this series does the rest of the conversions that Christoph >> wanted, and >> drops bioset_create(). >> >> Only lightly tested, but the changes are pretty mechanical. Based on your >> for-next tree. > > By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set' > you've altered the alignment of members in data structures. So I'll > need to audit all the data structures you've modified in DM. > > Could we get the backstory on _why_ you're making this change? > Would go a long way to helping me appreciate why this is a good use of > anyone's time. Yeah, it's in the first series, it gets rid of a pointer indirection. >>> >>> "Allows mempools to be embedded in other structs, getting rid of a >>> pointer indirection from allocation fastpaths." >>> >>> So this is about using contiguous memory or avoiding partial allocation >>> failure? Or both? >>> >>> Or more to it? Just trying to fully appreciate the theory behind the >>> perceived associated benefit. >> >> It's about avoiding a pointer indirection. Instead of having to >> follow a pointer to get to that struct, it's simple offset math off >> your main structure. >> >>> I do think the increased risk of these embedded bio_set and mempool_t >>> themselves crossing cachelines, or struct members that follow them doing >>> so, really detracts from these types of changes. >> >> Definitely something to look out for, though most of them should be >> per-dev structures and not in-flight structures. That makes it a bit >> less sensitive. But can't hurt to audit the layouts and adjust if >> necessary. This is why it's posted for review :-) > > This isn't something that is easily caught upfront. Yes we can all be > busy little beavers with pahole to audit alignment. But chances are > most people won't do it. > > Reality is there is potential for a regression due to false sharing to > creep in if a hot struct member suddenly starts straddling a cacheline. > That type of NUMA performance killer is pretty insidious and somewhat > tedious to hunt down even when looking for it with specialized tools: > https://joemario.github.io/blog/2016/09/01/c2c-blog/ IMHO you're making a big deal out of something that should not be. If the dm bits are that sensitive and cache line honed to perfection already due to previous regressions in that area, then it might not be a bad idea to have some compile checks for false cacheline sharing between sensitive members, or spilling of a sub-struct into multiple cachelines. It's not like this was pushed behind your back. It's posted for review. It's quite possible the net change is a win for dm. Let's focus on getting it reviewed, rather than pontificate on what could potentially go all wrong with this. -- Jens Axboe
Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()
On Mon, May 21 2018 at 10:36am -0400, Jens Axboewrote: > On 5/21/18 8:31 AM, Mike Snitzer wrote: > > On Mon, May 21 2018 at 10:19am -0400, > > Jens Axboe wrote: > > > >> On 5/21/18 8:03 AM, Mike Snitzer wrote: > >>> On Sun, May 20 2018 at 6:25pm -0400, > >>> Kent Overstreet wrote: > >>> > Jens - this series does the rest of the conversions that Christoph > wanted, and > drops bioset_create(). > > Only lightly tested, but the changes are pretty mechanical. Based on your > for-next tree. > >>> > >>> By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set' > >>> you've altered the alignment of members in data structures. So I'll > >>> need to audit all the data structures you've modified in DM. > >>> > >>> Could we get the backstory on _why_ you're making this change? > >>> Would go a long way to helping me appreciate why this is a good use of > >>> anyone's time. > >> > >> Yeah, it's in the first series, it gets rid of a pointer indirection. > > > > "Allows mempools to be embedded in other structs, getting rid of a > > pointer indirection from allocation fastpaths." > > > > So this is about using contiguous memory or avoiding partial allocation > > failure? Or both? > > > > Or more to it? Just trying to fully appreciate the theory behind the > > perceived associated benefit. > > It's about avoiding a pointer indirection. Instead of having to > follow a pointer to get to that struct, it's simple offset math off > your main structure. > > > I do think the increased risk of these embedded bio_set and mempool_t > > themselves crossing cachelines, or struct members that follow them doing > > so, really detracts from these types of changes. > > Definitely something to look out for, though most of them should be > per-dev structures and not in-flight structures. That makes it a bit > less sensitive. But can't hurt to audit the layouts and adjust if > necessary. This is why it's posted for review :-) This isn't something that is easily caught upfront. Yes we can all be busy little beavers with pahole to audit alignment. But chances are most people won't do it. Reality is there is potential for a regression due to false sharing to creep in if a hot struct member suddenly starts straddling a cacheline. That type of NUMA performance killer is pretty insidious and somewhat tedious to hunt down even when looking for it with specialized tools: https://joemario.github.io/blog/2016/09/01/c2c-blog/
Re: [PATCH] fs: clear writeback errors in inode_init_always
On Mon, May 21, 2018 at 07:20:05AM -0400, Jeff Layton wrote: > > As a side note, it looks like __loop_update_dio will discard an error > from vfs_fsync, so certain ioctls against a loop device can cause errors > to be lost. It seems like those ought to get propagated to userland or > to the blockdev's mapping somehow. I'm not sure it's worth it. All of the ioctls that call loop_update_dio are used just to configure the loop device. In practice they are never called once the loop device is actually running. It might be a better choice is to just define that errors get reset if there is any attempt to reconfigure the loop device. One of these ioctls change the offset that a loop device maps onto the backing file. So for example, while offset 0 of the loop device might have corresponds beginning of the backing file, after executing the ioctl, offset 0 of the loop device no corresponds to offset 2MB of the backing store. This might happen if we are resetting the loop device to point at the first partition of a disk image, for example. I really don't think that preserving the error status when we are doing that kind of radical configuration makes sense. Or for example, when we change the block size of the loop device; if the underlying backing store is an IBM Mainframe DASD with a 2k block size, the reason why the error was signalled was because there was an attempt to write a 1k block onto a 2k block device. So again, resetting the error status of the loop device is the right thing to do. The final thing that's worth perhaps exploring is whether or not we need all of these knobs and ioctls. In particular, LOOP_CHANGE_FD is used by losetup. It exists because back in prehistory, some distribution installer needed it for some obscure corner case. If I remember correctly, it had to do with switching out the initramfs floppy disk with the root file system floppy disk. We've wasted developer bandwidth trying to deal with syzbot complaints about that ioctl, and it's not clear it was worth the effort, because it's not clear any real code actually _needs_ that ioctl. It might have been easier to comment out the ioctl, and see if anyone screamed. Given that loop device is maintianed mostly on the margins, and it's not clear that lots of complicated error handling during setup is really necessary, this is basically a please to keep things simple, or at least no more complicated than it has to be. Cheers, - Ted
Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()
On 5/21/18 8:31 AM, Mike Snitzer wrote: > On Mon, May 21 2018 at 10:19am -0400, > Jens Axboewrote: > >> On 5/21/18 8:03 AM, Mike Snitzer wrote: >>> On Sun, May 20 2018 at 6:25pm -0400, >>> Kent Overstreet wrote: >>> Jens - this series does the rest of the conversions that Christoph wanted, and drops bioset_create(). Only lightly tested, but the changes are pretty mechanical. Based on your for-next tree. >>> >>> By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set' >>> you've altered the alignment of members in data structures. So I'll >>> need to audit all the data structures you've modified in DM. >>> >>> Could we get the backstory on _why_ you're making this change? >>> Would go a long way to helping me appreciate why this is a good use of >>> anyone's time. >> >> Yeah, it's in the first series, it gets rid of a pointer indirection. > > "Allows mempools to be embedded in other structs, getting rid of a > pointer indirection from allocation fastpaths." > > So this is about using contiguous memory or avoiding partial allocation > failure? Or both? > > Or more to it? Just trying to fully appreciate the theory behind the > perceived associated benefit. It's about avoiding a pointer indirection. Instead of having to follow a pointer to get to that struct, it's simple offset math off your main structure. > I do think the increased risk of these embedded bio_set and mempool_t > themselves crossing cachelines, or struct members that follow them doing > so, really detracts from these types of changes. Definitely something to look out for, though most of them should be per-dev structures and not in-flight structures. That makes it a bit less sensitive. But can't hurt to audit the layouts and adjust if necessary. This is why it's posted for review :-) -- Jens Axboe
Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()
On Mon, May 21 2018 at 10:19am -0400, Jens Axboewrote: > On 5/21/18 8:03 AM, Mike Snitzer wrote: > > On Sun, May 20 2018 at 6:25pm -0400, > > Kent Overstreet wrote: > > > >> Jens - this series does the rest of the conversions that Christoph wanted, > >> and > >> drops bioset_create(). > >> > >> Only lightly tested, but the changes are pretty mechanical. Based on your > >> for-next tree. > > > > By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set' > > you've altered the alignment of members in data structures. So I'll > > need to audit all the data structures you've modified in DM. > > > > Could we get the backstory on _why_ you're making this change? > > Would go a long way to helping me appreciate why this is a good use of > > anyone's time. > > Yeah, it's in the first series, it gets rid of a pointer indirection. "Allows mempools to be embedded in other structs, getting rid of a pointer indirection from allocation fastpaths." So this is about using contiguous memory or avoiding partial allocation failure? Or both? Or more to it? Just trying to fully appreciate the theory behind the perceived associated benefit. I do think the increased risk of these embedded bio_set and mempool_t themselves crossing cachelines, or struct members that follow them doing so, really detracts from these types of changes. Thanks, Mike
Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()
On 5/20/18 4:25 PM, Kent Overstreet wrote: > Jens - this series does the rest of the conversions that Christoph wanted, and > drops bioset_create(). > > Only lightly tested, but the changes are pretty mechanical. Based on your > for-next tree. Looks good to me. I'll let it simmer for a bit to give folks a chance to comment on it. -- Jens Axboe
Re: [PATCH 3/6] nvme: Move all IO out of controller reset
On Sat, May 19, 2018 at 07:03:58AM +0800, Ming Lei wrote: > On Fri, May 18, 2018 at 10:38:20AM -0600, Keith Busch wrote: > > + > > + if (unfreeze) > > + nvme_wait_freeze(>ctrl); > > + > > timeout may comes just before blk_mq_update_nr_hw_queues() or > the above nvme_wait_freeze(), then both two may hang forever. Why would it hang forever? The scan_work doesn't stop a timeout from triggering a reset to reclaim requests necessary to complete a freeze.
Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()
On 5/21/18 8:03 AM, Mike Snitzer wrote: > On Sun, May 20 2018 at 6:25pm -0400, > Kent Overstreetwrote: > >> Jens - this series does the rest of the conversions that Christoph wanted, >> and >> drops bioset_create(). >> >> Only lightly tested, but the changes are pretty mechanical. Based on your >> for-next tree. > > By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set' > you've altered the alignment of members in data structures. So I'll > need to audit all the data structures you've modified in DM. > > Could we get the backstory on _why_ you're making this change? > Would go a long way to helping me appreciate why this is a good use of > anyone's time. Yeah, it's in the first series, it gets rid of a pointer indirection. -- Jens Axboe
Re: [PATCH blktests] Fix block/011 to not use sysfs for device disabling
On Mon, May 21, 2018 at 02:37:56AM -0400, Yi Zhang wrote: > Hi Keith > I tried this patch on my R730 Server, but it lead to system hang after > setpci, could you help check it, thanks. > > Console log: > storageqe-62 login: > Kernel 4.17.0-rc5 on an x86_64 > > storageqe-62 login: [ 1058.118258] {1}[Hardware Error]: Hardware error from > APEI Generic Hardware Error Source: 3 > [ 1058.118261] {1}[Hardware Error]: event severity: fatal > [ 1058.118262] {1}[Hardware Error]: Error 0, type: fatal > [ 1058.118265] {1}[Hardware Error]: section_type: PCIe error > [ 1058.118266] {1}[Hardware Error]: port_type: 0, PCIe end point > [ 1058.118267] {1}[Hardware Error]: version: 1.16 > [ 1058.118269] {1}[Hardware Error]: command: 0x0400, status: 0x0010 > [ 1058.118270] {1}[Hardware Error]: device_id: :85:00.0 > [ 1058.118271] {1}[Hardware Error]: slot: 0 > [ 1058.118271] {1}[Hardware Error]: secondary_bus: 0x00 > [ 1058.118273] {1}[Hardware Error]: vendor_id: 0x144d, device_id: 0xa821 > [ 1058.118274] {1}[Hardware Error]: class_code: 020801 > [ 1058.118275] Kernel panic - not syncing: Fatal hardware error! > [ 1058.118301] Kernel Offset: 0x1480 from 0x8100 (relocation > range: 0x8000-0xbfff) Thanks for the notice. The test may be going to far with the config registers it's touching. Let me see if we just do the BME bit as Ming suggested fixes this.
Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()
On Sun, May 20 2018 at 6:25pm -0400, Kent Overstreetwrote: > Jens - this series does the rest of the conversions that Christoph wanted, and > drops bioset_create(). > > Only lightly tested, but the changes are pretty mechanical. Based on your > for-next tree. By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set' you've altered the alignment of members in data structures. So I'll need to audit all the data structures you've modified in DM. Could we get the backstory on _why_ you're making this change? Would go a long way to helping me appreciate why this is a good use of anyone's time. Thanks, Mike
Re: [PATCH 1/6] nvme: Sync request queues on reset
On Sat, May 19, 2018 at 08:01:42AM +0800, Ming Lei wrote: > > You keep saying that, but the controller state is global to the > > controller. It doesn't matter which namespace request_queue started the > > reset: every namespaces request queue sees the RESETTING controller state > > When timeouts come, the global state of RESETTING may not be updated > yet, so all the timeouts may not observe the state. Even prior to the RESETING state, every single command, no matter which namespace or request_queue it came on, is reclaimed by the driver. There *should* be no requests to timeout after nvme_dev_disable is called because the nvme driver returned control of all requests in the tagset to blk-mq. In any case, if blk-mq decides it won't complete those requests, we can just swap the order in the reset_work: sync first, uncondintionally disable. Does the following snippet look more okay? --- diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 17a0190bd88f..42af077ee07a 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2307,11 +2307,14 @@ static void nvme_reset_work(struct work_struct *work) goto out; /* -* If we're called to reset a live controller first shut it down before -* moving on. +* Ensure there are no timeout work in progress prior to forcefully +* disabling the queue. There is no harm in disabling the device even +* when it was already disabled, as this will forcefully reclaim any +* IOs that are stuck due to blk-mq's timeout handling that prevents +* timed out requests from completing. */ - if (dev->ctrl.ctrl_config & NVME_CC_ENABLE) - nvme_dev_disable(dev, false); + nvme_sync_queues(>ctrl); + nvme_dev_disable(dev, false); /* * Introduce CONNECTING state from nvme-fc/rdma transports to mark the --
Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()
On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenneywrote: > On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote: >> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney >> wrote: >> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote: >> >> Function is going to be used in transport over RDMA module >> >> in subsequent patches. >> >> >> >> Function returns next element in round-robin fashion, >> >> i.e. head will be skipped. NULL will be returned if list >> >> is observed as empty. >> >> >> >> Signed-off-by: Roman Pen >> >> Cc: Paul E. McKenney >> >> Cc: linux-ker...@vger.kernel.org >> >> --- >> >> include/linux/rculist.h | 19 +++ >> >> 1 file changed, 19 insertions(+) >> >> >> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h >> >> index 127f534fec94..b0840d5ab25a 100644 >> >> --- a/include/linux/rculist.h >> >> +++ b/include/linux/rculist.h >> >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct >> >> list_head *list, >> >> }) >> >> >> >> /** >> >> + * list_next_or_null_rr_rcu - get next list element in round-robin >> >> fashion. >> >> + * @head:the head for the list. >> >> + * @ptr:the list head to take the next element from. >> >> + * @type: the type of the struct this is embedded in. >> >> + * @memb: the name of the list_head within the struct. >> >> + * >> >> + * Next element returned in round-robin fashion, i.e. head will be >> >> skipped, >> >> + * but if list is observed as empty, NULL will be returned. >> >> + * >> >> + * This primitive may safely run concurrently with the _rcu list-mutation >> >> + * primitives such as list_add_rcu() as long as it's guarded by >> >> rcu_read_lock(). >> > >> > Of course, all the set of list_next_or_null_rr_rcu() invocations that >> > are round-robining a given list must all be under the same RCU read-side >> > critical section. For example, the following will break badly: >> > >> > struct foo *take_rr_step(struct list_head *head, struct foo *ptr) >> > { >> > struct foo *ret; >> > >> > rcu_read_lock(); >> > ret = list_next_or_null_rr_rcu(head, ptr, struct foo, >> > foolist); >> > rcu_read_unlock(); /* BUG */ >> > return ret; >> > } >> > >> > You need a big fat comment stating this, at the very least. The resulting >> > bug can be very hard to trigger and even harder to debug. >> > >> > And yes, I know that the same restriction applies to list_next_rcu() >> > and friends. The difference is that if you try to invoke those in an >> > infinite loop, you will be rapped on the knuckles as soon as you hit >> > the list header. Without that knuckle-rapping, RCU CPU stall warnings >> > might tempt people to do something broken like take_rr_step() above. >> >> Hi Paul, >> >> I need -rr behaviour for doing IO load-balancing when I choose next RDMA >> connection from the list in order to send a request, i.e. my code is >> something like the following: >> >> static struct conn *get_and_set_next_conn(void) >> { >> struct conn *conn; >> >> conn = rcu_dereferece(rcu_conn); >> if (unlikely(!conn)) >> return conn; > > Wait. Don't you need to restart from the beginning of the list in > this case? Or does the list never have anything added to it and is > rcu_conn initially the first element in the list? Hi Paul, No, I continue from the pointer, which I assigned on the previous IO in order to send IO fairly and keep load balanced. Initially @rcu_conn points to the first element, but elements can be deleted from the list and list can become empty. The deletion code is below. > >> conn = list_next_or_null_rr_rcu(_list, >> >entry, >> typeof(*conn), >> entry); >> rcu_assign_pointer(rcu_conn, conn); > > Linus is correct to doubt this code. You assign a pointer to the current > element to rcu_conn, which is presumably a per-CPU or global variable. > So far, so good ... I use per-CPU, in the first example I did not show that not to overcomplicate the code. > >> return conn; >> } >> >> rcu_read_lock(); >> conn = get_and_set_next_conn(); >> if (unlikely(!conn)) { >> /* ... */ >> } >> err = rdma_io(conn, request); >> rcu_read_unlock(); > > ... except that some other CPU might well remove the entry referenced by > rcu_conn at this point. It would have to wait for a grace period (e.g., > synchronize_rcu()), but the current CPU has exited its RCU read-side > critical section, and therefore is not blocking the grace
[PATCH] loop: clear wb_err in bd_inode when detaching backing file
From: Jeff LaytonWhen a loop block device encounters a writeback error, that error will get propagated to the bd_inode's wb_err field. If we then detach the backing file from it, attach another and fsync it, we'll get back the writeback error that we had from the previous backing file. This is a bit of a grey area as POSIX doesn't cover loop devices, but it is somewhat counterintuitive. If we detach a backing file from the loopdev while there are still unreported errors, take it as a sign that we're no longer interested in the previous file, and clear out the wb_err in the loop blockdev. Signed-off-by: Jeff Layton --- drivers/block/loop.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 5d4e31655d96..55cf554bc914 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1068,6 +1068,7 @@ static int loop_clr_fd(struct loop_device *lo) if (bdev) { bdput(bdev); invalidate_bdev(bdev); + bdev->bd_inode->i_mapping->wb_err = 0; } set_capacity(lo->lo_disk, 0); loop_sysfs_exit(lo); -- 2.17.0
Re: [PATCH] fs: clear writeback errors in inode_init_always
On Sun, 2018-05-20 at 15:41 -0400, Theodore Y. Ts'o wrote: > On Sun, May 20, 2018 at 12:20:09PM -0700, Matthew Wilcox wrote: > > Oh! I misunderstood. I thought that bd_inode->i_mapping pointed to > > lo_backing_file->f_mapping. > > > > So how about this to preserve the error _in the file with the error_? > > > > if (bdev) { > > bdput(bdev); > > invalidate_bdev(bdev); > > + filp->f_mapping->wb_err = bdev->bd_inode->i_mapping->wb_err; > > + bdev->bd_inode->i_mapping->wb_err = 0; > > } > > set_capacity(lo->lo_disk, 0); > > loop_sysfs_exit(lo); > > I don't think it's necessary. wb_err on the underlying file would > have already been set when the error was set. So if someone tried > calling fsync(2) on the underlying file, it would have collected the > error already. Re-setting the error when we detach the loop device > doesn't seem to make any sense. > (cc'ing linux-block) Yes, the error would have been set on the original file already. As long as lo_req_flush or __loop_update_dio weren't called before we detach the device, it should still be possible to call fsync on the original fd and get back the error. As a side note, it looks like __loop_update_dio will discard an error from vfs_fsync, so certain ioctls against a loop device can cause errors to be lost. It seems like those ought to get propagated to userland or to the blockdev's mapping somehow. -- Jeff Layton
Re: [PATCH] bdi: Fix oops in wb_workfn()
On Sat 19-05-18 23:27:09, Tetsuo Handa wrote: > Tetsuo Handa wrote: > > Jan Kara wrote: > > > Make wb_workfn() use wakeup_wb() for requeueing the work which takes all > > > the necessary precautions against racing with bdi unregistration. > > > > Yes, this patch will solve NULL pointer dereference bug. But is it OK to > > leave > > list_empty(>work_list) == false situation? Who takes over the role of > > making > > list_empty(>work_list) == true? > > syzbot is again reporting the same NULL pointer dereference. > > general protection fault in wb_workfn (2) > > https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206 Gaah... So we are still missing something. > Didn't we overlook something obvious in commit b8b784958eccbf8f ("bdi: > Fix oops in wb_workfn()") ? > > At first, I thought that that commit will solve NULL pointer dereference bug. > But what does > > if (!list_empty(>work_list)) > - mod_delayed_work(bdi_wq, >dwork, 0); > + wb_wakeup(wb); > else if (wb_has_dirty_io(wb) && dirty_writeback_interval) > wb_wakeup_delayed(wb); > > mean? > > static void wb_wakeup(struct bdi_writeback *wb) > { > spin_lock_bh(>work_lock); > if (test_bit(WB_registered, >state)) > mod_delayed_work(bdi_wq, >dwork, 0); > spin_unlock_bh(>work_lock); > } > > It means nothing but "we don't call mod_delayed_work() if WB_registered > bit was already cleared". Exactly. > But if WB_registered bit is not yet cleared when we hit > wb_wakeup_delayed() path? > > void wb_wakeup_delayed(struct bdi_writeback *wb) > { > unsigned long timeout; > > timeout = msecs_to_jiffies(dirty_writeback_interval * 10); > spin_lock_bh(>work_lock); > if (test_bit(WB_registered, >state)) > queue_delayed_work(bdi_wq, >dwork, timeout); > spin_unlock_bh(>work_lock); > } > > add_timer() is called because (presumably) timeout > 0. And after that > timeout expires, __queue_work() is called even if WB_registered bit is > already cleared before that timeout expires, isn't it? Yes. > void delayed_work_timer_fn(struct timer_list *t) > { > struct delayed_work *dwork = from_timer(dwork, t, timer); > > /* should have been called from irqsafe timer with irq already off */ > __queue_work(dwork->cpu, dwork->wq, >work); > } > > Then, wb_workfn() is after all scheduled even if we check for > WB_registered bit, isn't it? It can be queued after WB_registered bit is cleared but it cannot be queued after mod_delayed_work(bdi_wq, >dwork, 0) has finished. That function deletes the pending timer (the timer cannot be armed again because WB_registered is cleared) and queues what should be the last round of wb_workfn(). > Then, don't we need to check that > > mod_delayed_work(bdi_wq, >dwork, 0); > flush_delayed_work(>dwork); > > is really waiting for completion? At least, shouldn't we try below debug > output (not only for debugging this report but also generally desirable)? > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index 7441bd9..ccec8cd 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c > @@ -376,8 +376,10 @@ static void wb_shutdown(struct bdi_writeback *wb) >* tells wb_workfn() that @wb is dying and its work_list needs to >* be drained no matter what. >*/ > - mod_delayed_work(bdi_wq, >dwork, 0); > - flush_delayed_work(>dwork); > + if (!mod_delayed_work(bdi_wq, >dwork, 0)) > + printk(KERN_WARNING "wb_shutdown: mod_delayed_work() failed\n"); false return from mod_delayed_work() just means that there was no timer armed. That is a valid situation if there are no dirty data. > + if (!flush_delayed_work(>dwork)) > + printk(KERN_WARNING "wb_shutdown: flush_delayed_work() > failed\n"); And this is valid as well (although unlikely) if the work managed to complete on another CPU before flush_delayed_work() was called. So I don't think your warnings will help us much. But yes, we need to debug this somehow. For now I have no idea what could be still going wrong. Honza -- Jan KaraSUSE Labs, CR
Re: [PATCH blktests] Fix block/011 to not use sysfs for device disabling
Hi Keith I tried this patch on my R730 Server, but it lead to system hang after setpci, could you help check it, thanks. Console log: storageqe-62 login: Kernel 4.17.0-rc5 on an x86_64 storageqe-62 login: [ 1058.118258] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3 [ 1058.118261] {1}[Hardware Error]: event severity: fatal [ 1058.118262] {1}[Hardware Error]: Error 0, type: fatal [ 1058.118265] {1}[Hardware Error]: section_type: PCIe error [ 1058.118266] {1}[Hardware Error]: port_type: 0, PCIe end point [ 1058.118267] {1}[Hardware Error]: version: 1.16 [ 1058.118269] {1}[Hardware Error]: command: 0x0400, status: 0x0010 [ 1058.118270] {1}[Hardware Error]: device_id: :85:00.0 [ 1058.118271] {1}[Hardware Error]: slot: 0 [ 1058.118271] {1}[Hardware Error]: secondary_bus: 0x00 [ 1058.118273] {1}[Hardware Error]: vendor_id: 0x144d, device_id: 0xa821 [ 1058.118274] {1}[Hardware Error]: class_code: 020801 [ 1058.118275] Kernel panic - not syncing: Fatal hardware error! [ 1058.118301] Kernel Offset: 0x1480 from 0x8100 (relocation range: 0x8000-0xbfff) Best Regards, Yi Zhang - Original Message - From: "Keith Busch"To: "Omar Sandoval" , linux-block@vger.kernel.org, linux-n...@lists.infradead.org Cc: "Johannes Thumshirn" , "Christoph Hellwig" , "Jens Axboe" , "Ming Lei" , "Keith Busch" Sent: Saturday, May 19, 2018 1:42:47 AM Subject: [PATCH blktests] Fix block/011 to not use sysfs for device disabling The PCI sysfs interface may not be a dependable method for toggling the PCI device state to trigger the timeouts. This patch goes directly to the config space to make device failure occur. The success of this test is still senstive to timing, as it may disable IO memory when a driver is trying to bring it online. This can look like a permanent device failure from the driver's perspective. Signed-off-by: Keith Busch --- tests/block/011 | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/tests/block/011 b/tests/block/011 index 62e89f7..2fc0ffb 100755 --- a/tests/block/011 +++ b/tests/block/011 @@ -21,7 +21,7 @@ DESCRIPTION="disable PCI device while doing I/O" TIMED=1 requires() { - _have_fio + _have_fio && _have_program setpci } device_requires() { @@ -43,10 +43,11 @@ test_device() { _run_fio_rand_io --filename="$TEST_DEV" --size="$size" \ --ignore_error=EIO,ENXIO,ENODEV & + # toggle PCI Command Register's Memory and Bus Master enabling while kill -0 $! 2>/dev/null; do - echo 0 > "/sys/bus/pci/devices/${pdev}/enable" + setpci -s "${pdev}" 4.w=0:6 sleep .2 - echo 1 > "/sys/bus/pci/devices/${pdev}/enable" + setpci -s "${pdev}" 4.w=6:6 sleep .2 done -- 2.14.3
Re: [PATCH v2 14/26] ibtrs: include client and server modules into kernel compilation
Hi Roman, Thank you for the patch! Yet something to improve: [auto build test ERROR on linus/master] [also build test ERROR on v4.17-rc6 next-20180517] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Roman-Pen/InfiniBand-Transport-IBTRS-and-Network-Block-Device-IBNBD/20180520-222445 config: m68k-allyesconfig (attached as .config) compiler: m68k-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=m68k All errors (new ones prefixed by >>): drivers/mtd/nand/raw/nand_base.o: In function `nand_soft_waitrdy': nand_base.c:(.text+0x1022): undefined reference to `__udivdi3' drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.o: In function `ibtrs_clt_stats_wc_completion_to_str': >> ibtrs-clt-stats.c:(.text+0x172): undefined reference to `__udivdi3' drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.o: In function `ibtrs_clt_stats_sg_list_distr_to_str': ibtrs-clt-stats.c:(.text+0x49c): undefined reference to `__udivdi3' --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip