Re: [PATCH for-6.2/block V3 1/2] block: Data type conversion for IO accounting
On 12/21/22 06:05, Gulam Mohamed wrote: Change the data type of start and end time IO accounting variables in, block layer, from "unsigned long" to "u64". This is to enable nano-seconds granularity, in next commit, for the devices whose latency is less than milliseconds. Changes from V2 to V3 = 1. Changed all the required variables data-type to u64 as part of this first patch 2. Create a new setting '2' for iostats in sysfs in next patch 3. Change the code to get the ktime values when iostat=2 in next patch Signed-off-by: Gulam Mohamed --- block/blk-core.c | 24 block/blk.h | 2 +- drivers/block/drbd/drbd_int.h | 2 +- drivers/block/zram/zram_drv.c | 4 ++-- drivers/md/bcache/request.c | 10 +- drivers/md/dm-core.h | 2 +- drivers/md/dm.c | 2 +- drivers/md/md.h | 2 +- drivers/md/raid1.h| 2 +- drivers/md/raid10.h | 2 +- drivers/md/raid5.c| 2 +- drivers/nvdimm/btt.c | 2 +- drivers/nvdimm/pmem.c | 2 +- include/linux/blk_types.h | 2 +- include/linux/blkdev.h| 12 ++-- include/linux/part_stat.h | 2 +- nvme-mpath now also has stats, so struct nvme_request should also be updated.
Re: [dm-devel] [PATCH 09/11] nvme: remove a spurious clear of discard_alignment
Reviewed-by: Sagi Grimberg -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
Well, when it will failover, it will probably be directed to the poll queues. Maybe I'm missing something... In this patchset, because it isn't submitted directly from FS, there isn't one polling context associated with this bio, so its HIPRI flag will be cleared, then fallback to irq mode. I think that's fine for failover I/O... -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
+static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie) +{ + bio->bi_iter.bi_private_data = cookie; +} + Hey Ming, thinking about nvme-mpath, I'm thinking that this should be an exported function for failover. nvme-mpath updates bio.bi_dev when re-submitting I/Os to an alternate path, so I'm thinking that if this function is exported then nvme-mpath could do as little as the below to allow polling? -- diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 92adebfaf86f..e562e296153b 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -345,6 +345,7 @@ static void nvme_requeue_work(struct work_struct *work) struct nvme_ns_head *head = container_of(work, struct nvme_ns_head, requeue_work); struct bio *bio, *next; + blk_qc_t cookie; spin_lock_irq(>requeue_lock); next = bio_list_get(>requeue_list); @@ -359,7 +360,8 @@ static void nvme_requeue_work(struct work_struct *work) * path. */ bio_set_dev(bio, head->disk->part0); - submit_bio_noacct(bio); + cookie = submit_bio_noacct(bio); + blk_bio_poll_post_submit(bio, cookie); } } -- I/O failover will create misalignment from the polling context cpu and the submission cpu (running requeue_work), but I don't see if there is something that would break here... I understand requeue shouldn't be one usual event, and I guess it is just fine to fallback to IRQ based mode? Well, when it will failover, it will probably be directed to the poll queues. Maybe I'm missing something... This patchset actually doesn't cover such bio submission from kernel context. What is the difference? -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
+static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie) +{ + bio->bi_iter.bi_private_data = cookie; +} + Hey Ming, thinking about nvme-mpath, I'm thinking that this should be an exported function for failover. nvme-mpath updates bio.bi_dev when re-submitting I/Os to an alternate path, so I'm thinking that if this function is exported then nvme-mpath could do as little as the below to allow polling? -- diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 92adebfaf86f..e562e296153b 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -345,6 +345,7 @@ static void nvme_requeue_work(struct work_struct *work) struct nvme_ns_head *head = container_of(work, struct nvme_ns_head, requeue_work); struct bio *bio, *next; + blk_qc_t cookie; spin_lock_irq(>requeue_lock); next = bio_list_get(>requeue_list); @@ -359,7 +360,8 @@ static void nvme_requeue_work(struct work_struct *work) * path. */ bio_set_dev(bio, head->disk->part0); - submit_bio_noacct(bio); + cookie = submit_bio_noacct(bio); + blk_bio_poll_post_submit(bio, cookie); } } -- I/O failover will create misalignment from the polling context cpu and the submission cpu (running requeue_work), but I don't see if there is something that would break here... Thoughts? -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 03/24] nvme: let set_capacity_revalidate_and_notify update the bdev size
[ .. ] Originally nvme multipath would update/change the size of the multipath device according to the underlying path devices. With this patch the size of the multipath device will _not_ change if there is a change on the underlying devices. Yes, it will. Take a close look at nvme_update_disk_info and how it is called. Okay, then: What would be the correct way of handling a size update for NVMe multipath? Assuming we're getting an AEN for each path signalling the size change (or a controller reset leading to a size change). So if we're updating the size of the multipath device together with the path device at the first AEN/reset we'll end up with the other paths having a different size than the multipath device (and the path we've just been updating). - Do we care, or cross fingers and hope for the best? - Shouldn't we detect the case where we won't get a size update for the other paths, or, indeed, we have a genuine device size mismatch due to a misconfiguration on the target? IE shouldn't we have a flag 'size update pending' for the other paths,, to take them out ouf use temporarily until the other AENs/resets have been processed? the mpath device will take the minimum size from all the paths, that is what blk_stack_limits does. When the AEN for all the paths will arrive the mpath size will update. Not sure how this is different than what we have today... -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 3/3] nvme: don't call revalidate_disk from nvme_set_queue_dying
Reviewed-by: Sagi Grimberg -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 1/3] block: replace bd_set_size with bd_set_nr_sectors
Reviewed-by: Sagi Grimberg -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 2/3] block: fix locking for struct block_device size updates
Reviewed-by: Sagi Grimberg -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate
+ switch (nvme_req_disposition(req)) { + case COMPLETE: + nvme_complete_req(req); nvme_complete_rq calling nvme_complete_req... Maybe call it __nvme_complete_rq instead? That's what I had first, but it felt so strangely out of place next to the other nvme_*_req calls.. Maybe nvme_complete_rq needs to be renamed - what about nvme_req_done? I'd suggest to call nvme_complete_rq nvme_end_rq because it really calls blk_mq_end_request. That would confuse with nvme_end_request which calls blk_mq_complete_request... Maybe we need some naming improvements throughout. Maybe call nvme_req_disposition again locally here to not carry the is_ana_status. But not a biggy.. That means it would have to become non-static in scope, limiting inlining possibilities, etc. I see. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate
+static inline enum nvme_disposition nvme_req_disposition(struct request *req) +{ + if (likely(nvme_req(req)->status == 0)) + return COMPLETE; + + if (blk_noretry_request(req) || + (nvme_req(req)->status & NVME_SC_DNR) || + nvme_req(req)->retries >= nvme_max_retries) + return COMPLETE; + + if (req->cmd_flags & REQ_NVME_MPATH) { + switch (nvme_req(req)->status & 0x7ff) { + case NVME_SC_ANA_TRANSITION: + case NVME_SC_ANA_INACCESSIBLE: + case NVME_SC_ANA_PERSISTENT_LOSS: + return REDIRECT_ANA; + case NVME_SC_HOST_PATH_ERROR: + case NVME_SC_HOST_ABORTED_CMD: + return REDIRECT_TMP; + } + } + + if (blk_queue_dying(req->q)) + return COMPLETE; + return RETRY; +} + +static inline void nvme_complete_req(struct request *req) { blk_status_t status = nvme_error_status(nvme_req(req)->status); - trace_nvme_complete_rq(req); + if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && + req_op(req) == REQ_OP_ZONE_APPEND) + req->__sector = nvme_lba_to_sect(req->q->queuedata, + le64_to_cpu(nvme_req(req)->result.u64)); + + nvme_trace_bio_complete(req, status); + blk_mq_end_request(req, status); +} +void nvme_complete_rq(struct request *req) +{ + trace_nvme_complete_rq(req); nvme_cleanup_cmd(req); if (nvme_req(req)->ctrl->kas) nvme_req(req)->ctrl->comp_seen = true; - if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) - return; - - if (!blk_queue_dying(req->q)) { - nvme_retry_req(req); - return; - } - } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && - req_op(req) == REQ_OP_ZONE_APPEND) { - req->__sector = nvme_lba_to_sect(req->q->queuedata, - le64_to_cpu(nvme_req(req)->result.u64)); + switch (nvme_req_disposition(req)) { + case COMPLETE: + nvme_complete_req(req); nvme_complete_rq calling nvme_complete_req... Maybe call it __nvme_complete_rq instead? + return; + case RETRY: + nvme_retry_req(req); + return; + case REDIRECT_ANA: + nvme_failover_req(req, true); + return; + case REDIRECT_TMP: + nvme_failover_req(req, false); + return; } - - nvme_trace_bio_complete(req, status); - blk_mq_end_request(req, status); } EXPORT_SYMBOL_GPL(nvme_complete_rq); diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 3ded54d2c9c6ad..0c22b2c88687a2 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -65,51 +65,32 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, } } -bool nvme_failover_req(struct request *req) +void nvme_failover_req(struct request *req, bool is_ana_status) { struct nvme_ns *ns = req->q->queuedata; - u16 status = nvme_req(req)->status; unsigned long flags; - switch (status & 0x7ff) { - case NVME_SC_ANA_TRANSITION: - case NVME_SC_ANA_INACCESSIBLE: - case NVME_SC_ANA_PERSISTENT_LOSS: - /* -* If we got back an ANA error we know the controller is alive, -* but not ready to serve this namespaces. The spec suggests -* we should update our general state here, but due to the fact -* that the admin and I/O queues are not serialized that is -* fundamentally racy. So instead just clear the current path, -* mark the the path as pending and kick of a re-read of the ANA -* log page ASAP. -*/ - nvme_mpath_clear_current_path(ns); - if (ns->ctrl->ana_log_buf) { - set_bit(NVME_NS_ANA_PENDING, >flags); - queue_work(nvme_wq, >ctrl->ana_work); - } - break; - case NVME_SC_HOST_PATH_ERROR: - case NVME_SC_HOST_ABORTED_CMD: - /* -* Temporary transport disruption in talking to the controller. -* Try to send on a new path. -*/ - nvme_mpath_clear_current_path(ns); - break; - default: - /* This was a non-ANA error so follow the normal error path. */ - return false; + nvme_mpath_clear_current_path(ns); + + /* +* If we got back an ANA error we know the controller is alive, but not +* ready to serve this namespaces. The spec suggests we should update +* our general
Re: [dm-devel] nvme: restore use of blk_path_error() in nvme_complete_rq()
Hey Mike, The point is: blk_path_error() has nothing to do with NVMe errors. This is dm-multipath logic stuck in the middle of the NVMe error handling code. No, it is a means to have multiple subsystems (to this point both SCSI and NVMe) doing the correct thing when translating subsystem specific error codes to BLK_STS codes. Not exactly, don't find any use of this in scsi. The purpose is to make sure that nvme and dm speak the same language. SCSI doesn't need to do additional work to train a multipath layer because dm-multipath _is_ SCSI's multipathing in Linux. Agree. So ensuring SCSI properly classifies its error codes happens as a side-effect of ensuring continued multipath functionality. Hannes introduced all these differentiated error codes in block core because of SCSI. NVMe is meant to build on the infrastructure that was established. Yes, exactly my point. blk_path_error is designed to make nvme and dm-multipath speak the same language. If you, or others you name drop below, understood the point we wouldn't be having this conversation. You'd accept the point of blk_path_error() to be valid and required codification of what constitutes a retryable path error for the Linux block layer. This incident is a case where the specific nvme status was designed to retry on the same path respecting the controller retry delay. And because nvme used blk_path_error at the time it caused us to use a non-retryable status to get around that. Granted, no one had dm-multipath in mind. So in a sense, there is consensus on changing patch 35038bffa87da _because_ nvme no longer uses blk_path_error. Otherwise it would break. "break" meaning it would do failover instead of the more optimal local retry to the same controller. I see. Wish the header for commit 35038bffa87da touched on this even a little bit ;) I think it did, but maybe didn't put too much emphasis on it. But AFAICT the patch I provided doesn't compromise proper local retry -- as long as we first fix nvme_error_status() to return a retryable BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a prereq. Think of blk_path_error() as a more coarse-grained "is this retryable or a hard failure?" check. So for NVME_SC_CMD_INTERRUPTED, nvme_error_status() _should_ respond with something retryable (I'd prefer BLK_STS_RESOURCE to be honest). But blk_path_error semantically mean "is this a pathing error", or at least that what its name suggest. And then nvme_failover_req() is finer-grained; by returning false it now allows short-circuiting failover and reverting back to NVMe's normal controller based error recovery -- which it does for NVME_SC_CMD_INTERRUPTED due to "default" case in nvme_failover_req(). And then the previous nvme_error_status() classification of retryable BLK_STS obviously never gets returned up the IO stack; it gets thrown away. I see what you are saying. The issue is that the code becomes convoluted (it's a pathing error, oh wait, no its not a pathing error). Any BLK_STS mapping of NVMe specific error codes would need to not screw up by categorizing a retryable error as non-retryable (and vice-versa). But it is a special type of retryable. There is nothing that fits the semantics of the current behavior. I agree. But that's fine actually. And this issue is the poster-child for why properly supporting a duality of driver-level vs upper level multipathing capabilities is pretty much impossible unless a clean design factors out the different error classes -- and local error retry is handled before punting to higher level failover retry. Think if NVMe were to adopt a bit more disciplined "local then failover" error handling it all gets easier. I don't think punting before is easier, because we do a local retry if: - no multipathing sw on top - request needs retry (e.g. no DNR, notretry is off etc..) - nvme error is not pathing related (nvme_failover_req returned false) This local retry _is_ NVMe specific. NVMe should just own retrying on the same controller no matter what (I'll hope that such retry has awareness to not retry indefinitely though!). it will retry until the retry limit. And this has nothing to do with multipathing, so the logic to handle it shouldn't be trapped in nvme_failover_req(). Well given that nvme_failover_req already may not actually failover this makes some sense to me (although I did have some resistance to make it that way in the first place, but was convinced otherwise). I think NVMe can easily fix this by having an earlier stage of checking, e.g. nvme_local_retry_req(), that shortcircuits ever getting to higher-level multipathing consideration (be it native NVMe or DM multipathing) for cases like NVME_SC_CMD_INTERRUPTED. To be clear: the "default" case of nvme_failover_req() that returns false to fallback to NVMe's "local" normal NVMe error handling -- that can stay.. but a more explicit handling of cases like NVME_SC_CMD_INTERRUPTED should be added to
Re: [dm-devel] nvme: restore use of blk_path_error() in nvme_complete_rq()
Hey Mike, The point is: blk_path_error() has nothing to do with NVMe errors. This is dm-multipath logic stuck in the middle of the NVMe error handling code. No, it is a means to have multiple subsystems (to this point both SCSI and NVMe) doing the correct thing when translating subsystem specific error codes to BLK_STS codes. Not exactly, don't find any use of this in scsi. The purpose is to make sure that nvme and dm speak the same language. If you, or others you name drop below, understood the point we wouldn't be having this conversation. You'd accept the point of blk_path_error() to be valid and required codification of what constitutes a retryable path error for the Linux block layer. This incident is a case where the specific nvme status was designed to retry on the same path respecting the controller retry delay. And because nvme used blk_path_error at the time it caused us to use a non-retryable status to get around that. Granted, no one had dm-multipath in mind. So in a sense, there is consensus on changing patch 35038bffa87da _because_ nvme no longer uses blk_path_error. Otherwise it would break. Any BLK_STS mapping of NVMe specific error codes would need to not screw up by categorizing a retryable error as non-retryable (and vice-versa). But it is a special type of retryable. There is nothing that fits the semantics of the current behavior. Again, assuming proper testing, commit 35038bffa87 wouldn't have made it upstream if NVMe still used blk_path_error(). Agree. Yes, your commit 764e9332098c0 needlessly removed NVMe's use of blk_path_error(). If you're saying it wasn't needless please explain why. Fixes: 764e9332098c0 ("nvme-multipath: do not reset on unknown status") Cc: sta...@vger.kerneel.org Signed-off-by: Mike Snitzer --- drivers/nvme/host/core.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 6585d57112ad..072f629da4d8 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -290,8 +290,13 @@ void nvme_complete_rq(struct request *req) nvme_req(req)->ctrl->comp_seen = true; if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) { - if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req)) - return; + if (blk_path_error(status)) { + if (req->cmd_flags & REQ_NVME_MPATH) { + if (nvme_failover_req(req)) + return; + /* fallthru to normal error handling */ + } + } This would basically undo the patch Hannes, Christoph, and I put together in commit 764e9332098c0. This patch greatly simplified and improved the whole nvme_complete_rq() code path, and I don't support undoing that change. Please elaborate on how I've undone anything? I think you're right, this hasn't undone the patch from John, but it breaks NVME_SC_CMD_INTERRUPTED error handling behavior. The only thing I may have done is forced NVMe to take more care about properly translating its NVME_SC to BLK_STS in nvme_error_status(). Which is a good thing. I don't think there is an issue here of mistakenly converting an nvme status code to a wrong block status code. This conversion is there because there is no block status that give us the semantics we need which is apparently specific to nvme. I personally don't mind restoring blk_path_error to nvme, I don't particularly feel strong about it either. But for sure blk_path_error needs to first provide the semantics needed for NVME_SC_CMD_INTERRUPTED. ... Anyway, no new BLK_STS is needed at this point. More discipline with how NVMe's error handling is changed is. Please read the above. If NVMe wants to ensure its interface isn't broken regularly it _should_ use blk_path_error() to validate future nvme_error_status() changes. Miscategorizing NVMe errors to upper layers is a bug -- not open for debate. Again, don't agree is a Miscategorization nor a bug, its just something that is NVMe specific. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH V14 00/18] block: support multi-page bvec
V14: - drop patch(patch 4 in V13) for renaming bvec helpers, as suggested by Jens - use mp_bvec_* as multi-page bvec helper name - fix one build issue, which is caused by missing one converion of bio_for_each_segment_all in fs/gfs2 - fix one 32bit ARCH specific issue caused by segment boundary mask overflow Hey Ming, So is nvme-tcp also affected here? The only point where I see nvme-tcp can be affected is when initializing a bvec iter using bio_segments() as everywhere else we use iters which should transparently work.. I see that loop was converted, does it mean that nvme-tcp needs to call something like? -- bio_for_each_mp_bvec(bv, bio, iter) nr_bvecs++; -- -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH V10 09/19] block: introduce bio_bvecs()
Wait, I see that the bvec is still a single array per bio. When you said a table I thought you meant a 2-dimentional array... I mean a new 1-d table A has to be created for multiple bios in one rq, and build it in the following way rq_for_each_bvec(tmp, rq, rq_iter) *A = tmp; Then you can pass A to iov_iter_bvec() & send(). Given it is over TCP, I guess it should be doable for you to preallocate one 256-bvec table in one page for each request, then sets the max segment size as (unsigned int)-1, and max segment number as 256, the preallocated table should work anytime. 256 bvec table is really a lot to preallocate, especially when its not needed, I can easily initialize the bvec_iter on the bio bvec. If this involves preallocation of the worst-case than I don't consider this to be an improvement. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH V10 09/19] block: introduce bio_bvecs()
Yeah, that is the most common example, given merge is enabled in most of cases. If the driver or device doesn't care merge, you can disable it and always get single bio request, then the bio's bvec table can be reused for send(). Does bvec_iter span bvecs with your patches? I didn't see that change? Wait, I see that the bvec is still a single array per bio. When you said a table I thought you meant a 2-dimentional array... Unless I'm not mistaken, I think that the change is pretty simple then. However, nvme-tcp still needs to be bio aware unless we have some abstraction in place.. Which will mean that nvme-tcp will need to open-code bio_bvecs. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH V10 09/19] block: introduce bio_bvecs()
I would like to avoid growing bvec tables and keep everything preallocated. Plus, a bvec_iter operates on a bvec which means we'll need a table there as well... Not liking it so far... In case of bios in one request, we can't know how many bvecs there are except for calling rq_bvecs(), so it may not be suitable to preallocate the table. If you have to send the IO request in one send(), runtime allocation may be inevitable. I don't want to do that, I want to work on a single bvec at a time like the current implementation does. If you don't require to send the IO request in one send(), you may send one bio in one time, and just uses the bio's bvec table directly, such as the single bio case in lo_rw_aio(). we'd need some indication that we need to reinit my iter with the new bvec, today we do: static inline void nvme_tcp_advance_req(struct nvme_tcp_request *req, int len) { req->snd.data_sent += len; req->pdu_sent += len; iov_iter_advance(>snd.iter, len); if (!iov_iter_count(>snd.iter) && req->snd.data_sent < req->data_len) { req->snd.curr_bio = req->snd.curr_bio->bi_next; nvme_tcp_init_send_iter(req); } } and initialize the send iter. I imagine that now I will need to switch to the next bvec and only if I'm on the last I need to use the next bio... Do you offer an API for that? can this way avoid your blocking issue? You may see this example in branch 'rq->bio != rq->biotail' of lo_rw_aio(). This is exactly an example of not ignoring the bios... Yeah, that is the most common example, given merge is enabled in most of cases. If the driver or device doesn't care merge, you can disable it and always get single bio request, then the bio's bvec table can be reused for send(). Does bvec_iter span bvecs with your patches? I didn't see that change? I'm not sure how this helps me either. Unless we can set a bvec_iter to span bvecs or have an abstract bio crossing when we re-initialize the bvec_iter I don't see how I can ignore bios completely... rq_for_each_bvec() will iterate over all bvecs from all bios, so you needn't to see any bio in this req. But I don't need this iteration, I need a transparent API like; bvec2 = rq_bvec_next(rq, bvec) This way I can simply always reinit my iter without thinking about how the request/bios/bvecs are constructed... rq_bvecs() will return how many bvecs there are in this request(cover all bios in this req) Still not very useful given that I don't want to use a table... So looks nvme-tcp host driver might be the 2nd driver which benefits from multi-page bvec directly. The multi-page bvec V11 has passed my tests and addressed almost all the comments during review on V10. I removed bio_vecs() in V11, but it won't be big deal, we can introduce them anytime when there is the requirement. multipage-bvecs and nvme-tcp are going to conflict, so it would be good to coordinate on this. I think that nvme-tcp host needs some adjustments as setting a bvec_iter. I'm under the impression that the change is rather small and self-contained, but I'm not sure I have the full picture here. I guess I may not get your exact requirement on block io iterator from nvme-tcp too, :-( They are pretty much listed above. Today nvme-tcp sets an iterator with: vec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter); nsegs = bio_segments(bio); size = bio->bi_iter.bi_size; offset = bio->bi_iter.bi_bvec_done; iov_iter_bvec(>snd.iter, WRITE, vec, nsegs, size); and when done, iterate to the next bio and do the same. With multipage bvec it would be great if we can simply have something like rq_bvec_next() that would pretty much satisfy the requirements from the nvme-tcp side... -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH V10 09/19] block: introduce bio_bvecs()
Not sure I understand the 'blocking' problem in this case. We can build a bvec table from this req, and send them all in send(), I would like to avoid growing bvec tables and keep everything preallocated. Plus, a bvec_iter operates on a bvec which means we'll need a table there as well... Not liking it so far... can this way avoid your blocking issue? You may see this example in branch 'rq->bio != rq->biotail' of lo_rw_aio(). This is exactly an example of not ignoring the bios... If this way is what you need, I think you are right, even we may introduce the following helpers: rq_for_each_bvec() rq_bvecs() I'm not sure how this helps me either. Unless we can set a bvec_iter to span bvecs or have an abstract bio crossing when we re-initialize the bvec_iter I don't see how I can ignore bios completely... So looks nvme-tcp host driver might be the 2nd driver which benefits from multi-page bvec directly. The multi-page bvec V11 has passed my tests and addressed almost all the comments during review on V10. I removed bio_vecs() in V11, but it won't be big deal, we can introduce them anytime when there is the requirement. multipage-bvecs and nvme-tcp are going to conflict, so it would be good to coordinate on this. I think that nvme-tcp host needs some adjustments as setting a bvec_iter. I'm under the impression that the change is rather small and self-contained, but I'm not sure I have the full picture here. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH V10 09/19] block: introduce bio_bvecs()
The only user in your final tree seems to be the loop driver, and even that one only uses the helper for read/write bios. I think something like this would be much simpler in the end: The recently submitted nvme-tcp host driver should also be a user of this. Does it make sense to keep it as a helper then? I did take a brief look at the code, and I really don't understand why the heck it even deals with bios to start with. Like all the other nvme transports it is a blk-mq driver and should iterate over segments in a request and more or less ignore bios. Something is horribly wrong in the design. Can you explain a little more? I'm more than happy to change that but I'm not completely clear how... Before we begin a data transfer, we need to set our own iterator that will advance with the progression of the data transfer. We also need to keep in mind that all the data transfer (both send and recv) are completely non blocking (and zero-copy when we send). That means that every data movement needs to be able to suspend and resume asynchronously. i.e. we cannot use the following pattern: rq_for_each_segment(bvec, rq, rq_iter) { iov_iter_bvec(_iter, WRITE, , 1, bvec.bv_len); send(sock, iov_iter); } Given that a request can hold more than a single bio, I'm not clear on how we can achieve that without iterating over the bios in the request ourselves. Any useful insight? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH V10 09/19] block: introduce bio_bvecs()
The only user in your final tree seems to be the loop driver, and even that one only uses the helper for read/write bios. I think something like this would be much simpler in the end: The recently submitted nvme-tcp host driver should also be a user of this. Does it make sense to keep it as a helper then? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [for-4.16 PATCH 4/5] dm mpath: use NVMe error handling to know when an error is retryable
But interestingly, with my "mptest" link failure test (test_01_nvme_offline) I'm not actually seeing NVMe trigger a failure that needs a multipath layer (be it NVMe multipath or DM multipath) to fail a path and retry the IO. The pattern is that the link goes down, and nvme waits for it to come back (internalizing any failure) and then the IO continues.. so no multipath _really_ needed: [55284.011286] nvme nvme0: NVME-FC{0}: controller connectivity lost. Awaiting Reconnect [55284.020078] nvme nvme1: NVME-FC{1}: controller connectivity lost. Awaiting Reconnect [55284.028872] nvme nvme2: NVME-FC{2}: controller connectivity lost. Awaiting Reconnect [55284.037658] nvme nvme3: NVME-FC{3}: controller connectivity lost. Awaiting Reconnect [55295.157773] nvmet: ctrl 1 keep-alive timer (15 seconds) expired! [55295.157775] nvmet: ctrl 4 keep-alive timer (15 seconds) expired! [55295.157778] nvmet: ctrl 3 keep-alive timer (15 seconds) expired! [55295.157780] nvmet: ctrl 2 keep-alive timer (15 seconds) expired! [55295.157781] nvmet: ctrl 4 fatal error occurred! [55295.157784] nvmet: ctrl 3 fatal error occurred! [55295.157785] nvmet: ctrl 2 fatal error occurred! [55295.199816] nvmet: ctrl 1 fatal error occurred! [55304.047540] nvme nvme0: NVME-FC{0}: connectivity re-established. Attempting reconnect [55304.056533] nvme nvme1: NVME-FC{1}: connectivity re-established. Attempting reconnect [55304.066053] nvme nvme2: NVME-FC{2}: connectivity re-established. Attempting reconnect [55304.075037] nvme nvme3: NVME-FC{3}: connectivity re-established. Attempting reconnect [55304.373776] nvmet: creating controller 1 for subsystem mptestnqn for NQN nqn.2014-08.org.nvmexpress:uuid:----. [55304.373835] nvmet: creating controller 2 for subsystem mptestnqn for NQN nqn.2014-08.org.nvmexpress:uuid:----. [55304.373873] nvmet: creating controller 3 for subsystem mptestnqn for NQN nqn.2014-08.org.nvmexpress:uuid:----. [55304.373879] nvmet: creating controller 4 for subsystem mptestnqn for NQN nqn.2014-08.org.nvmexpress:uuid:----. [55304.430988] nvme nvme0: NVME-FC{0}: controller reconnect complete [55304.433124] nvme nvme3: NVME-FC{3}: controller reconnect complete [55304.433705] nvme nvme1: NVME-FC{1}: controller reconnect complete It seems if we have multipath ontop (again: either NVMe native multipath _or_ DM multipath) we'd prefer to have the equivalent of SCSI's REQ_FAILFAST_TRANSPORT support? But nvme_req_needs_retry() calls blk_noretry_request() which returns true if REQ_FAILFAST_TRANSPORT is set. Which results in nvme_req_needs_retry() returning false. Which causes nvme_complete_rq() to skip the multipath specific nvme_req_needs_failover(), etc. So all said: 1) why wait for connection recovery if we have other connections to try? I think NVMe needs to be plumbed for respecting REQ_FAILFAST_TRANSPORT. This is specific to FC fail fast logic, nvme-rdma will fail inflight commands as soon as the transport see an error (or keep alive timeout expires). It seems that FC wants to wait for the request retries counter to exceed but given that the queue isn't unquiesced, the requests are quiesced until the host will successfully reconnect. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] hch's native NVMe multipathing [was: Re: [PATCH 1/2] Don't blacklist nvme]
I'm fine with the path selectors getting moved out; maybe it'll encourage new path selectors to be developed. But there will need to be some userspace interface stood up to support your native NVMe multipathing (you may not think it needed but think in time there will be a need to configure _something_). That is the fragmentation I'm referring to. I guess one config option that we'd need is multibus vs. failover which are used per use-case. I'd have to say that having something much simpler than multipath-tools does sound appealing. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] NVMeoF multi-path setup
On 01/07/16 01:52, Mike Snitzer wrote: On Thu, Jun 30 2016 at 5:57pm -0400, Ming Linwrote: On Thu, 2016-06-30 at 14:08 -0700, Ming Lin wrote: Hi Mike, I'm trying to test NVMeoF multi-path. root@host:~# lsmod |grep dm_multipath dm_multipath 24576 0 root@host:~# ps aux |grep multipath root 13183 0.0 0.1 238452 4972 ?SLl 13:41 0:00 /sbin/multipathd I have nvme0 and nvme1 that are 2 paths to the same NVMe subsystem. root@host:/sys/class/nvme# grep . nvme*/address nvme0/address:traddr=192.168.3.2,trsvcid=1023 nvme1/address:traddr=192.168.2.2,trsvcid=1023 root@host:/sys/class/nvme# grep . nvme*/subsysnqn nvme0/subsysnqn:nqn.testiqn nvme1/subsysnqn:nqn.testiqn root@host:~# /lib/udev/scsi_id --export --whitelisted -d /dev/nvme1n1 ID_SCSI=1 ID_VENDOR=NVMe ID_VENDOR_ENC=NVMe\x20\x20\x20\x20 ID_MODEL=Linux ID_MODEL_ENC=Linux ID_REVISION=0-rc ID_TYPE=disk ID_SERIAL=SNVMe_Linux ID_SERIAL_SHORT= ID_SCSI_SERIAL=1122334455667788 root@host:~# /lib/udev/scsi_id --export --whitelisted -d /dev/nvme0n1 ID_SCSI=1 ID_VENDOR=NVMe ID_VENDOR_ENC=NVMe\x20\x20\x20\x20 ID_MODEL=Linux ID_MODEL_ENC=Linux ID_REVISION=0-rc ID_TYPE=disk ID_SERIAL=SNVMe_Linux ID_SERIAL_SHORT= ID_SCSI_SERIAL=1122334455667788 But seems multipathd didn't recognize these 2 devices. What else I'm missing? There are two problems: 1. there is no "/block/" in the path /sys/devices/virtual/nvme-fabrics/block/nvme0/nvme0n1 You clarified that it is: /sys/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n1 Do you have CONFIG_BLK_DEV_NVME_SCSI enabled? Indeed, for dm-multipath we need CONFIG_BLK_DEV_NVME_SCSI on. Another thing I noticed was that for nvme we need to manually set the timeout value because nvme devices don't expose device/timeout sysfs file. This causes dm-multipath to take a 200 seconds default (not a huge problem because we have keep alive in fabrics too). -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [RFC PATCH] dm: fix excessive dm-mq context switching
The perf report is very similar to the one that started this effort.. I'm afraid we'll need to resolve the per-target m->lock in order to scale with NUMA... Could be. Just for testing, you can try the 2 topmost commits I've put here (once applied both __multipath_map and multipath_busy won't have _any_ locking.. again, very much test-only): http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2 Hi Mike, So I still don't see the IOPs scale like I expected. With these two patches applied I see ~670K IOPs while the perf output is different and does not indicate a clear lock contention. -- - 4.67% fio [kernel.kallsyms][k] blk_account_io_start - blk_account_io_start - 56.05% blk_insert_cloned_request map_request dm_mq_queue_rq __blk_mq_run_hw_queue blk_mq_run_hw_queue blk_mq_insert_requests blk_mq_flush_plug_list blk_flush_plug_list blk_finish_plug do_io_submit SyS_io_submit entry_SYSCALL_64_fastpath + io_submit - 43.94% blk_mq_bio_to_request blk_mq_make_request generic_make_request submit_bio do_blockdev_direct_IO __blockdev_direct_IO blkdev_direct_IO generic_file_read_iter blkdev_read_iter aio_run_iocb io_submit_one do_io_submit SyS_io_submit entry_SYSCALL_64_fastpath + io_submit - 2.52% fio [dm_mod] [k] dm_mq_queue_rq - dm_mq_queue_rq - 99.16% __blk_mq_run_hw_queue blk_mq_run_hw_queue blk_mq_insert_requests blk_mq_flush_plug_list blk_flush_plug_list blk_finish_plug do_io_submit SyS_io_submit entry_SYSCALL_64_fastpath + io_submit - 2.52% fio [dm_mod] [k] dm_mq_queue_rq - dm_mq_queue_rq - 99.16% __blk_mq_run_hw_queue blk_mq_run_hw_queue blk_mq_insert_requests blk_mq_flush_plug_list blk_flush_plug_list blk_finish_plug do_io_submit SyS_io_submit entry_SYSCALL_64_fastpath + io_submit + 0.84% blk_mq_run_hw_queue - 2.46% fio [kernel.kallsyms][k] blk_mq_hctx_mark_pending - blk_mq_hctx_mark_pending - 99.79% blk_mq_insert_requests blk_mq_flush_plug_list blk_flush_plug_list blk_finish_plug do_io_submit SyS_io_submit entry_SYSCALL_64_fastpath + io_submit - 2.07% ksoftirqd/6 [kernel.kallsyms][k] blk_mq_run_hw_queues - blk_mq_run_hw_queues - 99.70% rq_completed dm_done dm_softirq_done blk_done_softirq + __do_softirq + 2.06% ksoftirqd/0 [kernel.kallsyms][k] blk_mq_run_hw_queues + 2.02% ksoftirqd/9 [kernel.kallsyms][k] blk_mq_run_hw_queues + 2.00% ksoftirqd/20 [kernel.kallsyms][k] blk_mq_run_hw_queues + 2.00% ksoftirqd/12 [kernel.kallsyms][k] blk_mq_run_hw_queues + 1.99% ksoftirqd/11 [kernel.kallsyms][k] blk_mq_run_hw_queues + 1.97% ksoftirqd/18 [kernel.kallsyms][k] blk_mq_run_hw_queues + 1.96% ksoftirqd/1 [kernel.kallsyms][k] blk_mq_run_hw_queues + 1.95% ksoftirqd/14 [kernel.kallsyms][k] blk_mq_run_hw_queues + 1.95% ksoftirqd/13 [kernel.kallsyms][k] blk_mq_run_hw_queues + 1.94% ksoftirqd/5 [kernel.kallsyms][k] blk_mq_run_hw_queues + 1.94% ksoftirqd/8 [kernel.kallsyms][k] blk_mq_run_hw_queues + 1.93% ksoftirqd/2 [kernel.kallsyms][k] blk_mq_run_hw_queues + 1.92% ksoftirqd/21 [kernel.kallsyms][k] blk_mq_run_hw_queues + 1.92% ksoftirqd/17 [kernel.kallsyms][k] blk_mq_run_hw_queues + 1.92% ksoftirqd/7 [kernel.kallsyms][k] blk_mq_run_hw_queues + 1.91% ksoftirqd/23 [kernel.kallsyms][k] blk_mq_run_hw_queues + 1.84% ksoftirqd/4 [kernel.kallsyms][k] blk_mq_run_hw_queues + 1.81% ksoftirqd/19 [kernel.kallsyms][k] blk_mq_run_hw_queues + 1.76% ksoftirqd/3 [kernel.kallsyms][k] blk_mq_run_hw_queues + 1.76% ksoftirqd/16 [kernel.kallsyms][k] blk_mq_run_hw_queues + 1.75% ksoftirqd/15 [kernel.kallsyms][k] blk_mq_run_hw_queues + 1.74% ksoftirqd/22 [kernel.kallsyms][k] blk_mq_run_hw_queues + 1.72% ksoftirqd/10 [kernel.kallsyms][k] blk_mq_run_hw_queues + 1.38% perf [kernel.kallsyms][k] copy_user_generic_string + 1.20% fio [kernel.kallsyms][k] enqueue_task_fair + 1.18% fio [kernel.kallsyms][k]
Re: [dm-devel] [RFC PATCH] dm: fix excessive dm-mq context switching
Hello Sagi, Hey Bart, Did you run your test on a NUMA system ? I did. If so, can you check with e.g. perf record -ags -e LLC-load-misses sleep 10 && perf report whether this workload triggers perhaps lock contention ? What you need to look for in the perf output is whether any functions occupy more than 10% CPU time. I will, thanks for the tip! -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [RFC PATCH] dm: fix excessive dm-mq context switching
If so, can you check with e.g. perf record -ags -e LLC-load-misses sleep 10 && perf report whether this workload triggers perhaps lock contention ? What you need to look for in the perf output is whether any functions occupy more than 10% CPU time. I will, thanks for the tip! The perf report is very similar to the one that started this effort.. I'm afraid we'll need to resolve the per-target m->lock in order to scale with NUMA... - 17.33% fio [kernel.kallsyms][k] queued_spin_lock_slowpath - queued_spin_lock_slowpath - 52.09% _raw_spin_lock_irq __multipath_map multipath_clone_and_map map_request dm_mq_queue_rq __blk_mq_run_hw_queue blk_mq_run_hw_queue blk_mq_insert_requests blk_mq_flush_plug_list blk_flush_plug_list blk_finish_plug do_io_submit SyS_io_submit entry_SYSCALL_64_fastpath + io_submit - 46.87% _raw_spin_lock_irqsave - 99.97% multipath_busy dm_mq_queue_rq __blk_mq_run_hw_queue blk_mq_run_hw_queue blk_mq_insert_requests blk_mq_flush_plug_list blk_flush_plug_list blk_finish_plug do_io_submit SyS_io_submit entry_SYSCALL_64_fastpath + io_submit + 4.99% fio [kernel.kallsyms][k] blk_account_io_start + 3.93% fio [dm_multipath] [k] __multipath_map + 2.64% fio [dm_multipath] [k] multipath_busy + 2.38% fio [kernel.kallsyms][k] _raw_spin_lock_irqsave + 2.31% fio [dm_mod] [k] dm_mq_queue_rq + 2.25% fio [kernel.kallsyms][k] blk_mq_hctx_mark_pending + 1.81% fio [kernel.kallsyms][k] blk_queue_enter + 1.61% perf [kernel.kallsyms][k] copy_user_generic_string + 1.40% fio [kernel.kallsyms][k] __blk_mq_run_hw_queue + 1.26% fio [kernel.kallsyms][k] part_round_stats + 1.14% fio [kernel.kallsyms][k] _raw_spin_lock_irq + 0.96% fio [kernel.kallsyms][k] __bt_get + 0.73% fio [kernel.kallsyms][k] enqueue_task_fair + 0.71% fio [kernel.kallsyms][k] enqueue_entity + 0.69% fio [dm_mod] [k] dm_start_request + 0.60% ksoftirqd/6 [kernel.kallsyms][k] blk_mq_run_hw_queues + 0.59% ksoftirqd/10 [kernel.kallsyms][k] blk_mq_run_hw_queues + 0.59% fio [kernel.kallsyms][k] _raw_spin_unlock_irqrestore + 0.58% ksoftirqd/19 [kernel.kallsyms][k] blk_mq_run_hw_queues + 0.58% ksoftirqd/18 [kernel.kallsyms][k] blk_mq_run_hw_queues + 0.58% ksoftirqd/23 [kernel.kallsyms][k] blk_mq_run_hw_queues -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [RFC PATCH] dm: fix excessive dm-mq context switching
Hi Mike, So I gave your patches a go (dm-4.6) but I still don't see the improvement you reported (while I do see a minor improvement). null_blk queue_mode=2 submit_queues=24 dm_mod blk_mq_nr_hw_queues=24 blk_mq_queue_depth=4096 use_blk_mq=Y I see 620K IOPs on dm_mq vs. 1750K IOPs on raw nullb0. blk_mq_nr_hw_queues=24 isn't likely to help you (but with these patches, the first being the most important, it shouldn't hurt either provided you have 24 cpus). I tried with less but as you said, it didn't have an impact... Could be you have multiple NUMA nodes and are seeing problems from that? I am running on a dual socket server, this can most likely be the culprit... I have 12 cpus (in the same physical cpu) and only a single NUMA node. I get the same results as blk_mq_nr_hw_queues=12 with blk_mq_nr_hw_queues=4 (same goes for null_blk submit_queues). I've seen my IOPs go from ~950K to ~1400K. The peak null_blk can get on my setup is ~1950K. So I'm still seeing a ~25% drop with dm-mq (but that is much better than the over 50% drop I saw seeing). That's what I was planning on :( Is there something I'm missing? Not sure, I just emailed out all my patches (and cc'd you). Please verify you're using the latest here (same as 'dm-4.6' branch): https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next I rebased a couple times... so please diff what you have tested against this latest 'dm-4.6' branch. I am. I'll try to instrument what's going on... -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel