Re: [PATCH for-6.2/block V3 1/2] block: Data type conversion for IO accounting

2022-12-25 Thread Sagi Grimberg




On 12/21/22 06:05, Gulam Mohamed wrote:

Change the data type of start and end time IO accounting variables in,
block layer, from "unsigned long" to "u64". This is to enable nano-seconds
granularity, in next commit, for the devices whose latency is less than
milliseconds.

Changes from V2 to V3
=
1. Changed all the required variables data-type to u64 as part of this
first patch
2. Create a new setting '2' for iostats in sysfs in next patch
3. Change the code to get the ktime values when iostat=2 in next patch

Signed-off-by: Gulam Mohamed 
---
  block/blk-core.c  | 24 
  block/blk.h   |  2 +-
  drivers/block/drbd/drbd_int.h |  2 +-
  drivers/block/zram/zram_drv.c |  4 ++--
  drivers/md/bcache/request.c   | 10 +-
  drivers/md/dm-core.h  |  2 +-
  drivers/md/dm.c   |  2 +-
  drivers/md/md.h   |  2 +-
  drivers/md/raid1.h|  2 +-
  drivers/md/raid10.h   |  2 +-
  drivers/md/raid5.c|  2 +-
  drivers/nvdimm/btt.c  |  2 +-
  drivers/nvdimm/pmem.c |  2 +-
  include/linux/blk_types.h |  2 +-
  include/linux/blkdev.h| 12 ++--
  include/linux/part_stat.h |  2 +-


nvme-mpath now also has stats, so struct nvme_request should also be
updated.



Re: [dm-devel] [PATCH 09/11] nvme: remove a spurious clear of discard_alignment

2022-04-26 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll

2021-03-24 Thread Sagi Grimberg




Well, when it will failover, it will probably be directed to the poll
queues. Maybe I'm missing something...


In this patchset, because it isn't submitted directly from FS, there
isn't one polling context associated with this bio, so its HIPRI flag
will be cleared, then fallback to irq mode.


I think that's fine for failover I/O...

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll

2021-03-23 Thread Sagi Grimberg




+static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
+{
+   bio->bi_iter.bi_private_data = cookie;
+}
+


Hey Ming, thinking about nvme-mpath, I'm thinking that this should be
an exported function for failover. nvme-mpath updates bio.bi_dev
when re-submitting I/Os to an alternate path, so I'm thinking
that if this function is exported then nvme-mpath could do as little
as the below to allow polling?

--
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 92adebfaf86f..e562e296153b 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -345,6 +345,7 @@ static void nvme_requeue_work(struct work_struct *work)
 struct nvme_ns_head *head =
 container_of(work, struct nvme_ns_head, requeue_work);
 struct bio *bio, *next;
+   blk_qc_t cookie;

 spin_lock_irq(>requeue_lock);
 next = bio_list_get(>requeue_list);
@@ -359,7 +360,8 @@ static void nvme_requeue_work(struct work_struct *work)
  * path.
  */
 bio_set_dev(bio, head->disk->part0);
-   submit_bio_noacct(bio);
+   cookie = submit_bio_noacct(bio);
+   blk_bio_poll_post_submit(bio, cookie);
 }
  }
--

I/O failover will create misalignment from the polling context cpu and
the submission cpu (running requeue_work), but I don't see if there is
something that would break here...


I understand requeue shouldn't be one usual event, and I guess it is just
fine to fallback to IRQ based mode?


Well, when it will failover, it will probably be directed to the poll
queues. Maybe I'm missing something...


This patchset actually doesn't cover such bio submission from kernel context.


What is the difference?

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll

2021-03-22 Thread Sagi Grimberg




+static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
+{
+   bio->bi_iter.bi_private_data = cookie;
+}
+


Hey Ming, thinking about nvme-mpath, I'm thinking that this should be
an exported function for failover. nvme-mpath updates bio.bi_dev
when re-submitting I/Os to an alternate path, so I'm thinking
that if this function is exported then nvme-mpath could do as little
as the below to allow polling?

--
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 92adebfaf86f..e562e296153b 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -345,6 +345,7 @@ static void nvme_requeue_work(struct work_struct *work)
struct nvme_ns_head *head =
container_of(work, struct nvme_ns_head, requeue_work);
struct bio *bio, *next;
+   blk_qc_t cookie;

spin_lock_irq(>requeue_lock);
next = bio_list_get(>requeue_list);
@@ -359,7 +360,8 @@ static void nvme_requeue_work(struct work_struct *work)
 * path.
 */
bio_set_dev(bio, head->disk->part0);
-   submit_bio_noacct(bio);
+   cookie = submit_bio_noacct(bio);
+   blk_bio_poll_post_submit(bio, cookie);
}
 }
--

I/O failover will create misalignment from the polling context cpu and
the submission cpu (running requeue_work), but I don't see if there is
something that would break here...

Thoughts?

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 03/24] nvme: let set_capacity_revalidate_and_notify update the bdev size

2020-11-09 Thread Sagi Grimberg



[ .. ]

Originally nvme multipath would update/change the size of the multipath
device according to the underlying path devices.
With this patch the size of the multipath device will _not_ change if 
there

is a change on the underlying devices.


Yes, it will.  Take a close look at nvme_update_disk_info and how it is
called.

Okay, then: What would be the correct way of handling a size update for 
NVMe multipath?

Assuming we're getting an AEN for each path signalling the size change
(or a controller reset leading to a size change).
So if we're updating the size of the multipath device together with the 
path device at the first AEN/reset we'll end up with the other paths 
having a different size than the multipath device (and the path we've 
just been updating).

- Do we care, or cross fingers and hope for the best?
- Shouldn't we detect the case where we won't get a size update for the 
other paths, or, indeed, we have a genuine device size mismatch due to a 
misconfiguration on the target?


IE shouldn't we have a flag 'size update pending' for the other paths,, 
to take them out ouf use temporarily until the other AENs/resets have 
been processed?


the mpath device will take the minimum size from all the paths, that is
what blk_stack_limits does. When the AEN for all the paths will arrive
the mpath size will update.

Not sure how this is different than what we have today...

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 3/3] nvme: don't call revalidate_disk from nvme_set_queue_dying

2020-08-24 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 1/3] block: replace bd_set_size with bd_set_nr_sectors

2020-08-24 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 2/3] block: fix locking for struct block_device size updates

2020-08-24 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate

2020-08-14 Thread Sagi Grimberg




+   switch (nvme_req_disposition(req)) {
+   case COMPLETE:
+   nvme_complete_req(req);


nvme_complete_rq calling nvme_complete_req... Maybe call it
__nvme_complete_rq instead?


That's what I had first, but it felt so strangely out of place next
to the other nvme_*_req calls..

Maybe nvme_complete_rq needs to be renamed - what about nvme_req_done?


I'd suggest to call nvme_complete_rq nvme_end_rq because it really calls
blk_mq_end_request. That would confuse with nvme_end_request which
calls blk_mq_complete_request... Maybe we need some naming improvements
throughout.


Maybe call nvme_req_disposition again locally here to not carry
the is_ana_status. But not a biggy..


That means it would have to become non-static in scope, limiting
inlining possibilities, etc.


I see.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RESEND PATCH] nvme: explicitly use normal NVMe error handling when appropriate

2020-08-14 Thread Sagi Grimberg




+static inline enum nvme_disposition nvme_req_disposition(struct request *req)
+{
+   if (likely(nvme_req(req)->status == 0))
+   return COMPLETE;
+
+   if (blk_noretry_request(req) ||
+   (nvme_req(req)->status & NVME_SC_DNR) ||
+   nvme_req(req)->retries >= nvme_max_retries)
+   return COMPLETE;
+
+   if (req->cmd_flags & REQ_NVME_MPATH) {
+   switch (nvme_req(req)->status & 0x7ff) {
+   case NVME_SC_ANA_TRANSITION:
+   case NVME_SC_ANA_INACCESSIBLE:
+   case NVME_SC_ANA_PERSISTENT_LOSS:
+   return REDIRECT_ANA;
+   case NVME_SC_HOST_PATH_ERROR:
+   case NVME_SC_HOST_ABORTED_CMD:
+   return REDIRECT_TMP;
+   }
+   }
+
+   if (blk_queue_dying(req->q))
+   return COMPLETE;
+   return RETRY;
+}
+
+static inline void nvme_complete_req(struct request *req)
  {
blk_status_t status = nvme_error_status(nvme_req(req)->status);
  
-	trace_nvme_complete_rq(req);

+   if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
+   req_op(req) == REQ_OP_ZONE_APPEND)
+   req->__sector = nvme_lba_to_sect(req->q->queuedata,
+   le64_to_cpu(nvme_req(req)->result.u64));
+
+   nvme_trace_bio_complete(req, status);
+   blk_mq_end_request(req, status);
+}
  
+void nvme_complete_rq(struct request *req)

+{
+   trace_nvme_complete_rq(req);
nvme_cleanup_cmd(req);
  
  	if (nvme_req(req)->ctrl->kas)

nvme_req(req)->ctrl->comp_seen = true;
  
-	if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {

-   if ((req->cmd_flags & REQ_NVME_MPATH) && nvme_failover_req(req))
-   return;
-
-   if (!blk_queue_dying(req->q)) {
-   nvme_retry_req(req);
-   return;
-   }
-   } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
-  req_op(req) == REQ_OP_ZONE_APPEND) {
-   req->__sector = nvme_lba_to_sect(req->q->queuedata,
-   le64_to_cpu(nvme_req(req)->result.u64));
+   switch (nvme_req_disposition(req)) {
+   case COMPLETE:
+   nvme_complete_req(req);


nvme_complete_rq calling nvme_complete_req... Maybe call it
__nvme_complete_rq instead?


+   return;
+   case RETRY:
+   nvme_retry_req(req);
+   return;
+   case REDIRECT_ANA:
+   nvme_failover_req(req, true);
+   return;
+   case REDIRECT_TMP:
+   nvme_failover_req(req, false);
+   return;
}
-
-   nvme_trace_bio_complete(req, status);
-   blk_mq_end_request(req, status);
  }
  EXPORT_SYMBOL_GPL(nvme_complete_rq);
  
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c

index 3ded54d2c9c6ad..0c22b2c88687a2 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -65,51 +65,32 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
}
  }
  
-bool nvme_failover_req(struct request *req)

+void nvme_failover_req(struct request *req, bool is_ana_status)
  {
struct nvme_ns *ns = req->q->queuedata;
-   u16 status = nvme_req(req)->status;
unsigned long flags;
  
-	switch (status & 0x7ff) {

-   case NVME_SC_ANA_TRANSITION:
-   case NVME_SC_ANA_INACCESSIBLE:
-   case NVME_SC_ANA_PERSISTENT_LOSS:
-   /*
-* If we got back an ANA error we know the controller is alive,
-* but not ready to serve this namespaces.  The spec suggests
-* we should update our general state here, but due to the fact
-* that the admin and I/O queues are not serialized that is
-* fundamentally racy.  So instead just clear the current path,
-* mark the the path as pending and kick of a re-read of the ANA
-* log page ASAP.
-*/
-   nvme_mpath_clear_current_path(ns);
-   if (ns->ctrl->ana_log_buf) {
-   set_bit(NVME_NS_ANA_PENDING, >flags);
-   queue_work(nvme_wq, >ctrl->ana_work);
-   }
-   break;
-   case NVME_SC_HOST_PATH_ERROR:
-   case NVME_SC_HOST_ABORTED_CMD:
-   /*
-* Temporary transport disruption in talking to the controller.
-* Try to send on a new path.
-*/
-   nvme_mpath_clear_current_path(ns);
-   break;
-   default:
-   /* This was a non-ANA error so follow the normal error path. */
-   return false;
+   nvme_mpath_clear_current_path(ns);
+
+   /*
+* If we got back an ANA error we know the controller is alive, but not
+* ready to serve this namespaces.  The spec suggests we should update
+* our general 

Re: [dm-devel] nvme: restore use of blk_path_error() in nvme_complete_rq()

2020-08-07 Thread Sagi Grimberg




Hey Mike,


The point is: blk_path_error() has nothing to do with NVMe errors.
This is dm-multipath logic stuck in the middle of the NVMe error
handling code.


No, it is a means to have multiple subsystems (to this point both SCSI
and NVMe) doing the correct thing when translating subsystem specific
error codes to BLK_STS codes.


Not exactly, don't find any use of this in scsi. The purpose is to make
sure that nvme and dm speak the same language.


SCSI doesn't need to do additional work to train a multipath layer
because dm-multipath _is_ SCSI's multipathing in Linux.


Agree.


So ensuring SCSI properly classifies its error codes happens as a
side-effect of ensuring continued multipath functionality.

Hannes introduced all these differentiated error codes in block core
because of SCSI.  NVMe is meant to build on the infrastructure that was
established.


Yes, exactly my point. blk_path_error is designed to make nvme and 
dm-multipath speak the same language.



If you, or others you name drop below, understood the point we wouldn't
be having this conversation.  You'd accept the point of blk_path_error()
to be valid and required codification of what constitutes a retryable
path error for the Linux block layer.


This incident is a case where the specific nvme status was designed
to retry on the same path respecting the controller retry delay.
And because nvme used blk_path_error at the time it caused us to use a
non-retryable status to get around that. Granted, no one had
dm-multipath in mind.

So in a sense, there is consensus on changing patch 35038bffa87da
_because_ nvme no longer uses blk_path_error. Otherwise it would break.


"break" meaning it would do failover instead of the more optimal local
retry to the same controller.

I see.  Wish the header for commit 35038bffa87da touched on this even a
little bit ;)


I think it did, but maybe didn't put too much emphasis on it.


But AFAICT the patch I provided doesn't compromise proper local retry --
as long as we first fix nvme_error_status() to return a retryable
BLK_STS for NVME_SC_CMD_INTERRUPTED -- which I assumed as a prereq.

Think of blk_path_error() as a more coarse-grained "is this retryable or
a hard failure?" check.  So for NVME_SC_CMD_INTERRUPTED,
nvme_error_status() _should_ respond with something retryable (I'd
prefer BLK_STS_RESOURCE to be honest).


But blk_path_error semantically mean "is this a pathing error", or at
least that what its name suggest.


And then nvme_failover_req() is finer-grained; by returning false it now
allows short-circuiting failover and reverting back to NVMe's normal
controller based error recovery -- which it does for
NVME_SC_CMD_INTERRUPTED due to "default" case in nvme_failover_req().

And then the previous nvme_error_status() classification of retryable
BLK_STS obviously never gets returned up the IO stack; it gets thrown
away.


I see what you are saying. The issue is that the code becomes
convoluted (it's a pathing error, oh wait, no its not a pathing error).


Any BLK_STS mapping of NVMe specific error codes would need to not screw
up by categorizing a retryable error as non-retryable (and vice-versa).


But it is a special type of retryable. There is nothing that fits the
semantics of the current behavior.


I agree.  But that's fine actually.

And this issue is the poster-child for why properly supporting a duality
of driver-level vs upper level multipathing capabilities is pretty much
impossible unless a clean design factors out the different error classes
-- and local error retry is handled before punting to higher level
failover retry.  Think if NVMe were to adopt a bit more disciplined
"local then failover" error handling it all gets easier.


I don't think punting before is easier, because we do a local retry if:
- no multipathing sw on top
- request needs retry (e.g. no DNR, notretry is off etc..)
- nvme error is not pathing related (nvme_failover_req returned false)


This local retry _is_ NVMe specific.  NVMe should just own retrying on
the same controller no matter what (I'll hope that such retry has
awareness to not retry indefinitely though!).


it will retry until the retry limit.


 And this has nothing to
do with multipathing, so the logic to handle it shouldn't be trapped in
nvme_failover_req().


Well given that nvme_failover_req already may not actually failover this
makes some sense to me (although I did have some resistance to make it
that way in the first place, but was convinced otherwise).


I think NVMe can easily fix this by having an earlier stage of checking,
e.g. nvme_local_retry_req(), that shortcircuits ever getting to
higher-level multipathing consideration (be it native NVMe or DM
multipathing) for cases like NVME_SC_CMD_INTERRUPTED.
To be clear: the "default" case of nvme_failover_req() that returns
false to fallback to NVMe's "local" normal NVMe error handling -- that
can stay.. but a more explicit handling of cases like
NVME_SC_CMD_INTERRUPTED should be added to 

Re: [dm-devel] nvme: restore use of blk_path_error() in nvme_complete_rq()

2020-08-06 Thread Sagi Grimberg

Hey Mike,


The point is: blk_path_error() has nothing to do with NVMe errors.
This is dm-multipath logic stuck in the middle of the NVMe error
handling code.


No, it is a means to have multiple subsystems (to this point both SCSI
and NVMe) doing the correct thing when translating subsystem specific
error codes to BLK_STS codes.


Not exactly, don't find any use of this in scsi. The purpose is to make
sure that nvme and dm speak the same language.


If you, or others you name drop below, understood the point we wouldn't
be having this conversation.  You'd accept the point of blk_path_error()
to be valid and required codification of what constitutes a retryable
path error for the Linux block layer.


This incident is a case where the specific nvme status was designed
to retry on the same path respecting the controller retry delay.
And because nvme used blk_path_error at the time it caused us to use a
non-retryable status to get around that. Granted, no one had 
dm-multipath in mind.


So in a sense, there is consensus on changing patch 35038bffa87da
_because_ nvme no longer uses blk_path_error. Otherwise it would break.


Any BLK_STS mapping of NVMe specific error codes would need to not screw
up by categorizing a retryable error as non-retryable (and vice-versa).


But it is a special type of retryable. There is nothing that fits the
semantics of the current behavior.


Again, assuming proper testing, commit 35038bffa87 wouldn't have made it
upstream if NVMe still used blk_path_error().


Agree.


Yes, your commit 764e9332098c0 needlessly removed NVMe's use of
blk_path_error().  If you're saying it wasn't needless please explain
why.


 Fixes: 764e9332098c0 ("nvme-multipath: do not reset on unknown status")
 Cc: sta...@vger.kerneel.org
 Signed-off-by: Mike Snitzer 
 ---
 drivers/nvme/host/core.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

 diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
 index 6585d57112ad..072f629da4d8 100644
 --- a/drivers/nvme/host/core.c
 +++ b/drivers/nvme/host/core.c
 @@ -290,8 +290,13 @@ void nvme_complete_rq(struct request *req)
 nvme_req(req)->ctrl->comp_seen = true;

 if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
 -   if ((req->cmd_flags & REQ_NVME_MPATH) && 
nvme_failover_req(req))
 -   return;
 +   if (blk_path_error(status)) {
 +   if (req->cmd_flags & REQ_NVME_MPATH) {
 +   if (nvme_failover_req(req))
 +   return;
 +   /* fallthru to normal error handling */
 +   }
 +   }

This would basically undo the patch Hannes, Christoph, and I put together in
commit 764e9332098c0.  This patch greatly simplified and improved the
whole nvme_complete_rq() code path, and I don't support undoing that change.


Please elaborate on how I've undone anything?


I think you're right, this hasn't undone the patch from John, but it
breaks NVME_SC_CMD_INTERRUPTED error handling behavior.


The only thing I may have done is forced NVMe to take more care about
properly translating its NVME_SC to BLK_STS in nvme_error_status().
Which is a good thing.


I don't think there is an issue here of mistakenly converting an nvme
status code to a wrong block status code. This conversion is there
because there is no block status that give us the semantics we need
which is apparently specific to nvme.

I personally don't mind restoring blk_path_error to nvme, I don't
particularly feel strong about it either. But for sure blk_path_error
needs to first provide the semantics needed for NVME_SC_CMD_INTERRUPTED.

...


Anyway, no new BLK_STS is needed at this point.  More discipline with
how NVMe's error handling is changed is.


Please read the above.


If NVMe wants to ensure its
interface isn't broken regularly it _should_ use blk_path_error() to
validate future nvme_error_status() changes.  Miscategorizing NVMe
errors to upper layers is a bug -- not open for debate.


Again, don't agree is a Miscategorization nor a bug, its just something
that is NVMe specific.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH V14 00/18] block: support multi-page bvec

2019-01-21 Thread Sagi Grimberg




V14:
- drop patch(patch 4 in V13) for renaming bvec helpers, as suggested by 
Jens
- use mp_bvec_* as multi-page bvec helper name
- fix one build issue, which is caused by missing one converion of
bio_for_each_segment_all in fs/gfs2
- fix one 32bit ARCH specific issue caused by segment boundary mask
overflow


Hey Ming,

So is nvme-tcp also affected here? The only point where I see nvme-tcp
can be affected is when initializing a bvec iter using bio_segments() as
everywhere else we use iters which should transparently work..

I see that loop was converted, does it mean that nvme-tcp needs to
call something like?
--
bio_for_each_mp_bvec(bv, bio, iter)
nr_bvecs++;
--

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-20 Thread Sagi Grimberg




Wait, I see that the bvec is still a single array per bio. When you said
a table I thought you meant a 2-dimentional array...


I mean a new 1-d table A has to be created for multiple bios in one rq,
and build it in the following way

rq_for_each_bvec(tmp, rq, rq_iter)
 *A = tmp;

Then you can pass A to iov_iter_bvec() & send().

Given it is over TCP, I guess it should be doable for you to preallocate one
256-bvec table in one page for each request, then sets the max segment size as
(unsigned int)-1, and max segment number as 256, the preallocated table
should work anytime.


256 bvec table is really a lot to preallocate, especially when its not
needed, I can easily initialize the bvec_iter on the bio bvec. If this
involves preallocation of the worst-case than I don't consider this to
be an improvement.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-20 Thread Sagi Grimberg




Yeah, that is the most common example, given merge is enabled
in most of cases. If the driver or device doesn't care merge,
you can disable it and always get single bio request, then the
bio's bvec table can be reused for send().


Does bvec_iter span bvecs with your patches? I didn't see that change?


Wait, I see that the bvec is still a single array per bio. When you said
a table I thought you meant a 2-dimentional array...

Unless I'm not mistaken, I think that the change is pretty simple then.
However, nvme-tcp still needs to be bio aware unless we have some
abstraction in place.. Which will mean that nvme-tcp will need to
open-code bio_bvecs.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-20 Thread Sagi Grimberg




I would like to avoid growing bvec tables and keep everything
preallocated. Plus, a bvec_iter operates on a bvec which means
we'll need a table there as well... Not liking it so far...


In case of bios in one request, we can't know how many bvecs there
are except for calling rq_bvecs(), so it may not be suitable to
preallocate the table. If you have to send the IO request in one send(),
runtime allocation may be inevitable.


I don't want to do that, I want to work on a single bvec at a time like
the current implementation does.


If you don't require to send the IO request in one send(), you may send
one bio in one time, and just uses the bio's bvec table directly,
such as the single bio case in lo_rw_aio().


we'd need some indication that we need to reinit my iter with the
new bvec, today we do:

static inline void nvme_tcp_advance_req(struct nvme_tcp_request *req,
int len)
{
req->snd.data_sent += len;
req->pdu_sent += len;
iov_iter_advance(>snd.iter, len);
if (!iov_iter_count(>snd.iter) &&
req->snd.data_sent < req->data_len) {
req->snd.curr_bio = req->snd.curr_bio->bi_next;
nvme_tcp_init_send_iter(req);
}
}

and initialize the send iter. I imagine that now I will need to
switch to the next bvec and only if I'm on the last I need to
use the next bio...

Do you offer an API for that?



can this way avoid your blocking issue? You may see this
example in branch 'rq->bio != rq->biotail' of lo_rw_aio().


This is exactly an example of not ignoring the bios...


Yeah, that is the most common example, given merge is enabled
in most of cases. If the driver or device doesn't care merge,
you can disable it and always get single bio request, then the
bio's bvec table can be reused for send().


Does bvec_iter span bvecs with your patches? I didn't see that change?


I'm not sure how this helps me either. Unless we can set a bvec_iter to
span bvecs or have an abstract bio crossing when we re-initialize the
bvec_iter I don't see how I can ignore bios completely...


rq_for_each_bvec() will iterate over all bvecs from all bios, so you
needn't to see any bio in this req.


But I don't need this iteration, I need a transparent API like;
bvec2 = rq_bvec_next(rq, bvec)

This way I can simply always reinit my iter without thinking about how
the request/bios/bvecs are constructed...


rq_bvecs() will return how many bvecs there are in this request(cover
all bios in this req)


Still not very useful given that I don't want to use a table...


So looks nvme-tcp host driver might be the 2nd driver which benefits
from multi-page bvec directly.

The multi-page bvec V11 has passed my tests and addressed almost
all the comments during review on V10. I removed bio_vecs() in V11,
but it won't be big deal, we can introduce them anytime when there
is the requirement.


multipage-bvecs and nvme-tcp are going to conflict, so it would be good
to coordinate on this. I think that nvme-tcp host needs some adjustments
as setting a bvec_iter. I'm under the impression that the change is rather
small and self-contained, but I'm not sure I have the full
picture here.


I guess I may not get your exact requirement on block io iterator from nvme-tcp
too, :-(


They are pretty much listed above. Today nvme-tcp sets an iterator with:

vec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
nsegs = bio_segments(bio);
size = bio->bi_iter.bi_size;
offset = bio->bi_iter.bi_bvec_done;
iov_iter_bvec(>snd.iter, WRITE, vec, nsegs, size);

and when done, iterate to the next bio and do the same.

With multipage bvec it would be great if we can simply have
something like rq_bvec_next() that would pretty much satisfy
the requirements from the nvme-tcp side...

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-20 Thread Sagi Grimberg




Not sure I understand the 'blocking' problem in this case.

We can build a bvec table from this req, and send them all
in send(),


I would like to avoid growing bvec tables and keep everything
preallocated. Plus, a bvec_iter operates on a bvec which means
we'll need a table there as well... Not liking it so far...


can this way avoid your blocking issue? You may see this
example in branch 'rq->bio != rq->biotail' of lo_rw_aio().


This is exactly an example of not ignoring the bios...


If this way is what you need, I think you are right, even we may
introduce the following helpers:

rq_for_each_bvec()
rq_bvecs()


I'm not sure how this helps me either. Unless we can set a bvec_iter to
span bvecs or have an abstract bio crossing when we re-initialize the
bvec_iter I don't see how I can ignore bios completely...


So looks nvme-tcp host driver might be the 2nd driver which benefits
from multi-page bvec directly.

The multi-page bvec V11 has passed my tests and addressed almost
all the comments during review on V10. I removed bio_vecs() in V11,
but it won't be big deal, we can introduce them anytime when there
is the requirement.


multipage-bvecs and nvme-tcp are going to conflict, so it would be good
to coordinate on this. I think that nvme-tcp host needs some adjustments
as setting a bvec_iter. I'm under the impression that the change is 
rather small and self-contained, but I'm not sure I have the full

picture here.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-20 Thread Sagi Grimberg




The only user in your final tree seems to be the loop driver, and
even that one only uses the helper for read/write bios.

I think something like this would be much simpler in the end:


The recently submitted nvme-tcp host driver should also be a user
of this. Does it make sense to keep it as a helper then?


I did take a brief look at the code, and I really don't understand
why the heck it even deals with bios to start with.  Like all the
other nvme transports it is a blk-mq driver and should iterate
over segments in a request and more or less ignore bios.  Something
is horribly wrong in the design.


Can you explain a little more? I'm more than happy to change that but
I'm not completely clear how...

Before we begin a data transfer, we need to set our own iterator that
will advance with the progression of the data transfer. We also need to
keep in mind that all the data transfer (both send and recv) are
completely non blocking (and zero-copy when we send).

That means that every data movement needs to be able to suspend
and resume asynchronously. i.e. we cannot use the following pattern:
rq_for_each_segment(bvec, rq, rq_iter) {
iov_iter_bvec(_iter, WRITE, , 1, bvec.bv_len);
send(sock, iov_iter);
}

Given that a request can hold more than a single bio, I'm not clear on
how we can achieve that without iterating over the bios in the request
ourselves.

Any useful insight?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-19 Thread Sagi Grimberg




The only user in your final tree seems to be the loop driver, and
even that one only uses the helper for read/write bios.

I think something like this would be much simpler in the end:


The recently submitted nvme-tcp host driver should also be a user
of this. Does it make sense to keep it as a helper then?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [for-4.16 PATCH 4/5] dm mpath: use NVMe error handling to know when an error is retryable

2017-12-20 Thread Sagi Grimberg



But interestingly, with my "mptest" link failure test
(test_01_nvme_offline) I'm not actually seeing NVMe trigger a failure
that needs a multipath layer (be it NVMe multipath or DM multipath) to
fail a path and retry the IO.  The pattern is that the link goes down,
and nvme waits for it to come back (internalizing any failure) and then
the IO continues.. so no multipath _really_ needed:

[55284.011286] nvme nvme0: NVME-FC{0}: controller connectivity lost. Awaiting 
Reconnect
[55284.020078] nvme nvme1: NVME-FC{1}: controller connectivity lost. Awaiting 
Reconnect
[55284.028872] nvme nvme2: NVME-FC{2}: controller connectivity lost. Awaiting 
Reconnect
[55284.037658] nvme nvme3: NVME-FC{3}: controller connectivity lost. Awaiting 
Reconnect
[55295.157773] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
[55295.157775] nvmet: ctrl 4 keep-alive timer (15 seconds) expired!
[55295.157778] nvmet: ctrl 3 keep-alive timer (15 seconds) expired!
[55295.157780] nvmet: ctrl 2 keep-alive timer (15 seconds) expired!
[55295.157781] nvmet: ctrl 4 fatal error occurred!
[55295.157784] nvmet: ctrl 3 fatal error occurred!
[55295.157785] nvmet: ctrl 2 fatal error occurred!
[55295.199816] nvmet: ctrl 1 fatal error occurred!
[55304.047540] nvme nvme0: NVME-FC{0}: connectivity re-established. Attempting 
reconnect
[55304.056533] nvme nvme1: NVME-FC{1}: connectivity re-established. Attempting 
reconnect
[55304.066053] nvme nvme2: NVME-FC{2}: connectivity re-established. Attempting 
reconnect
[55304.075037] nvme nvme3: NVME-FC{3}: connectivity re-established. Attempting 
reconnect
[55304.373776] nvmet: creating controller 1 for subsystem mptestnqn for NQN 
nqn.2014-08.org.nvmexpress:uuid:----.
[55304.373835] nvmet: creating controller 2 for subsystem mptestnqn for NQN 
nqn.2014-08.org.nvmexpress:uuid:----.
[55304.373873] nvmet: creating controller 3 for subsystem mptestnqn for NQN 
nqn.2014-08.org.nvmexpress:uuid:----.
[55304.373879] nvmet: creating controller 4 for subsystem mptestnqn for NQN 
nqn.2014-08.org.nvmexpress:uuid:----.
[55304.430988] nvme nvme0: NVME-FC{0}: controller reconnect complete
[55304.433124] nvme nvme3: NVME-FC{3}: controller reconnect complete
[55304.433705] nvme nvme1: NVME-FC{1}: controller reconnect complete

It seems if we have multipath ontop (again: either NVMe native multipath
_or_ DM multipath) we'd prefer to have the equivalent of SCSI's
REQ_FAILFAST_TRANSPORT support?

But nvme_req_needs_retry() calls blk_noretry_request() which returns
true if REQ_FAILFAST_TRANSPORT is set.  Which results in
nvme_req_needs_retry() returning false.  Which causes nvme_complete_rq()
to skip the multipath specific nvme_req_needs_failover(), etc.

So all said:

1) why wait for connection recovery if we have other connections to try?
I think NVMe needs to be plumbed for respecting REQ_FAILFAST_TRANSPORT.


This is specific to FC fail fast logic, nvme-rdma will fail inflight
commands as soon as the transport see an error (or keep alive timeout
expires).

It seems that FC wants to wait for the request retries counter to exceed
but given that the queue isn't unquiesced, the requests are quiesced
until the host will successfully reconnect.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] hch's native NVMe multipathing [was: Re: [PATCH 1/2] Don't blacklist nvme]

2017-02-16 Thread Sagi Grimberg



I'm fine with the path selectors getting moved out; maybe it'll
encourage new path selectors to be developed.

But there will need to be some userspace interface stood up to support
your native NVMe multipathing (you may not think it needed but think in
time there will be a need to configure _something_).  That is the
fragmentation I'm referring to.


I guess one config option that we'd need is multibus vs. failover
which are used per use-case.

I'd have to say that having something much simpler than multipath-tools
does sound appealing.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] NVMeoF multi-path setup

2016-07-13 Thread Sagi Grimberg



On 01/07/16 01:52, Mike Snitzer wrote:

On Thu, Jun 30 2016 at  5:57pm -0400,
Ming Lin  wrote:


On Thu, 2016-06-30 at 14:08 -0700, Ming Lin wrote:

Hi Mike,

I'm trying to test NVMeoF multi-path.

root@host:~# lsmod |grep dm_multipath
dm_multipath   24576  0
root@host:~# ps aux |grep multipath
root 13183  0.0  0.1 238452  4972 ?SLl  13:41   0:00
/sbin/multipathd

I have nvme0 and nvme1 that are 2 paths to the same NVMe subsystem.

root@host:/sys/class/nvme# grep . nvme*/address
nvme0/address:traddr=192.168.3.2,trsvcid=1023
nvme1/address:traddr=192.168.2.2,trsvcid=1023

root@host:/sys/class/nvme# grep . nvme*/subsysnqn
nvme0/subsysnqn:nqn.testiqn
nvme1/subsysnqn:nqn.testiqn

root@host:~# /lib/udev/scsi_id --export --whitelisted -d /dev/nvme1n1
ID_SCSI=1
ID_VENDOR=NVMe
ID_VENDOR_ENC=NVMe\x20\x20\x20\x20
ID_MODEL=Linux
ID_MODEL_ENC=Linux
ID_REVISION=0-rc
ID_TYPE=disk
ID_SERIAL=SNVMe_Linux
ID_SERIAL_SHORT=
ID_SCSI_SERIAL=1122334455667788

root@host:~# /lib/udev/scsi_id --export --whitelisted -d /dev/nvme0n1
ID_SCSI=1
ID_VENDOR=NVMe
ID_VENDOR_ENC=NVMe\x20\x20\x20\x20
ID_MODEL=Linux
ID_MODEL_ENC=Linux
ID_REVISION=0-rc
ID_TYPE=disk
ID_SERIAL=SNVMe_Linux
ID_SERIAL_SHORT=
ID_SCSI_SERIAL=1122334455667788

But seems multipathd didn't recognize these 2 devices.

What else I'm missing?


There are two problems:

1. there is no "/block/" in the path

/sys/devices/virtual/nvme-fabrics/block/nvme0/nvme0n1


You clarified that it is:
/sys/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n1

Do you have CONFIG_BLK_DEV_NVME_SCSI enabled?


Indeed, for dm-multipath we need CONFIG_BLK_DEV_NVME_SCSI on.

Another thing I noticed was that for nvme we need to manually
set the timeout value because nvme devices don't expose
device/timeout sysfs file. This causes dm-multipath to take
a 200 seconds default (not a huge problem because we
have keep alive in fabrics too).

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH] dm: fix excessive dm-mq context switching

2016-02-08 Thread Sagi Grimberg



The perf report is very similar to the one that started this effort..

I'm afraid we'll need to resolve the per-target m->lock in order
to scale with NUMA...


Could be.  Just for testing, you can try the 2 topmost commits I've put
here (once applied both __multipath_map and multipath_busy won't have
_any_ locking.. again, very much test-only):

http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=devel2


Hi Mike,

So I still don't see the IOPs scale like I expected. With these two
patches applied I see ~670K IOPs while the perf output is different
and does not indicate a clear lock contention.

--
-   4.67%  fio  [kernel.kallsyms][k] 
blk_account_io_start

   - blk_account_io_start
  - 56.05% blk_insert_cloned_request
   map_request
   dm_mq_queue_rq
   __blk_mq_run_hw_queue
   blk_mq_run_hw_queue
   blk_mq_insert_requests
   blk_mq_flush_plug_list
   blk_flush_plug_list
   blk_finish_plug
   do_io_submit
   SyS_io_submit
   entry_SYSCALL_64_fastpath
 + io_submit
  - 43.94% blk_mq_bio_to_request
   blk_mq_make_request
   generic_make_request
   submit_bio
   do_blockdev_direct_IO
   __blockdev_direct_IO
   blkdev_direct_IO
   generic_file_read_iter
   blkdev_read_iter
   aio_run_iocb
   io_submit_one
   do_io_submit
   SyS_io_submit
   entry_SYSCALL_64_fastpath
 + io_submit
-   2.52%  fio  [dm_mod] [k] dm_mq_queue_rq
   - dm_mq_queue_rq
  - 99.16% __blk_mq_run_hw_queue
   blk_mq_run_hw_queue
   blk_mq_insert_requests
   blk_mq_flush_plug_list
   blk_flush_plug_list
   blk_finish_plug
   do_io_submit
   SyS_io_submit
   entry_SYSCALL_64_fastpath
 + io_submit
-   2.52%  fio  [dm_mod] [k] dm_mq_queue_rq
   - dm_mq_queue_rq
  - 99.16% __blk_mq_run_hw_queue
   blk_mq_run_hw_queue
   blk_mq_insert_requests
   blk_mq_flush_plug_list
   blk_flush_plug_list
   blk_finish_plug
   do_io_submit
   SyS_io_submit
   entry_SYSCALL_64_fastpath
 + io_submit
  + 0.84% blk_mq_run_hw_queue
-   2.46%  fio  [kernel.kallsyms][k] 
blk_mq_hctx_mark_pending

   - blk_mq_hctx_mark_pending
  - 99.79% blk_mq_insert_requests
   blk_mq_flush_plug_list
   blk_flush_plug_list
   blk_finish_plug
   do_io_submit
   SyS_io_submit
   entry_SYSCALL_64_fastpath
 + io_submit
-   2.07%  ksoftirqd/6  [kernel.kallsyms][k] 
blk_mq_run_hw_queues

   - blk_mq_run_hw_queues
  - 99.70% rq_completed
   dm_done
   dm_softirq_done
   blk_done_softirq
 + __do_softirq
+   2.06%  ksoftirqd/0  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   2.02%  ksoftirqd/9  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   2.00% ksoftirqd/20  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   2.00% ksoftirqd/12  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   1.99% ksoftirqd/11  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   1.97% ksoftirqd/18  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   1.96%  ksoftirqd/1  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   1.95% ksoftirqd/14  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   1.95% ksoftirqd/13  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   1.94%  ksoftirqd/5  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   1.94%  ksoftirqd/8  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   1.93%  ksoftirqd/2  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   1.92% ksoftirqd/21  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   1.92% ksoftirqd/17  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   1.92%  ksoftirqd/7  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   1.91% ksoftirqd/23  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   1.84%  ksoftirqd/4  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   1.81% ksoftirqd/19  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   1.76%  ksoftirqd/3  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   1.76% ksoftirqd/16  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   1.75% ksoftirqd/15  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   1.74% ksoftirqd/22  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   1.72% ksoftirqd/10  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   1.38% perf  [kernel.kallsyms][k] 
copy_user_generic_string

+   1.20%  fio  [kernel.kallsyms][k] enqueue_task_fair
+   1.18%  fio  [kernel.kallsyms][k] 

Re: [dm-devel] [RFC PATCH] dm: fix excessive dm-mq context switching

2016-02-07 Thread Sagi Grimberg



Hello Sagi,


Hey Bart,


Did you run your test on a NUMA system ?


I did.


If so, can you check with e.g.
perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
workload triggers perhaps lock contention ? What you need to look for in
the perf output is whether any functions occupy more than 10% CPU time.


I will, thanks for the tip!

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH] dm: fix excessive dm-mq context switching

2016-02-07 Thread Sagi Grimberg



If so, can you check with e.g.
perf record -ags -e LLC-load-misses sleep 10 && perf report whether this
workload triggers perhaps lock contention ? What you need to look for in
the perf output is whether any functions occupy more than 10% CPU time.


I will, thanks for the tip!


The perf report is very similar to the one that started this effort..

I'm afraid we'll need to resolve the per-target m->lock in order
to scale with NUMA...

-  17.33%  fio  [kernel.kallsyms][k] 
queued_spin_lock_slowpath

   - queued_spin_lock_slowpath
  - 52.09% _raw_spin_lock_irq
   __multipath_map
   multipath_clone_and_map
   map_request
   dm_mq_queue_rq
   __blk_mq_run_hw_queue
   blk_mq_run_hw_queue
   blk_mq_insert_requests
   blk_mq_flush_plug_list
   blk_flush_plug_list
   blk_finish_plug
   do_io_submit
   SyS_io_submit
   entry_SYSCALL_64_fastpath
 + io_submit
  - 46.87% _raw_spin_lock_irqsave
 - 99.97% multipath_busy
  dm_mq_queue_rq
  __blk_mq_run_hw_queue
  blk_mq_run_hw_queue
  blk_mq_insert_requests
  blk_mq_flush_plug_list
  blk_flush_plug_list
  blk_finish_plug
  do_io_submit
  SyS_io_submit
  entry_SYSCALL_64_fastpath
+ io_submit
+   4.99%  fio  [kernel.kallsyms][k] 
blk_account_io_start

+   3.93%  fio  [dm_multipath]   [k] __multipath_map
+   2.64%  fio  [dm_multipath]   [k] multipath_busy
+   2.38%  fio  [kernel.kallsyms][k] 
_raw_spin_lock_irqsave

+   2.31%  fio  [dm_mod] [k] dm_mq_queue_rq
+   2.25%  fio  [kernel.kallsyms][k] 
blk_mq_hctx_mark_pending

+   1.81%  fio  [kernel.kallsyms][k] blk_queue_enter
+   1.61% perf  [kernel.kallsyms][k] 
copy_user_generic_string
+   1.40%  fio  [kernel.kallsyms][k] 
__blk_mq_run_hw_queue

+   1.26%  fio  [kernel.kallsyms][k] part_round_stats
+   1.14%  fio  [kernel.kallsyms][k] _raw_spin_lock_irq
+   0.96%  fio  [kernel.kallsyms][k] __bt_get
+   0.73%  fio  [kernel.kallsyms][k] enqueue_task_fair
+   0.71%  fio  [kernel.kallsyms][k] enqueue_entity
+   0.69%  fio  [dm_mod] [k] dm_start_request
+   0.60%  ksoftirqd/6  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   0.59% ksoftirqd/10  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   0.59%  fio  [kernel.kallsyms][k] 
_raw_spin_unlock_irqrestore
+   0.58% ksoftirqd/19  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   0.58% ksoftirqd/18  [kernel.kallsyms][k] 
blk_mq_run_hw_queues
+   0.58% ksoftirqd/23  [kernel.kallsyms][k] 
blk_mq_run_hw_queues


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH] dm: fix excessive dm-mq context switching

2016-02-07 Thread Sagi Grimberg



Hi Mike,

So I gave your patches a go (dm-4.6) but I still don't see the
improvement you reported (while I do see a minor improvement).

null_blk queue_mode=2 submit_queues=24
dm_mod blk_mq_nr_hw_queues=24 blk_mq_queue_depth=4096 use_blk_mq=Y

I see 620K IOPs on dm_mq vs. 1750K IOPs on raw nullb0.


blk_mq_nr_hw_queues=24 isn't likely to help you (but with these patches,
the first being the most important, it shouldn't hurt either provided
you have 24 cpus).


I tried with less but as you said, it didn't have an impact...


Could be you have multiple NUMA nodes and are seeing problems from that?


I am running on a dual socket server, this can most likely be the
culprit...


I have 12 cpus (in the same physical cpu) and only a single NUMA node.
I get the same results as blk_mq_nr_hw_queues=12 with
blk_mq_nr_hw_queues=4 (same goes for null_blk submit_queues).
I've seen my IOPs go from ~950K to ~1400K.  The peak null_blk can get on
my setup is ~1950K.  So I'm still seeing a ~25% drop with dm-mq (but
that is much better than the over 50% drop I saw seeing).


That's what I was planning on :(


Is there something I'm missing?


Not sure, I just emailed out all my patches (and cc'd you).  Please
verify you're using the latest here (same as 'dm-4.6' branch):
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next

I rebased a couple times... so please diff what you have tested against
this latest 'dm-4.6' branch.


I am. I'll try to instrument what's going on...

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel