Re: [PATCH V2] blk-mq: re-build queue map in case of kdump kernel

2018-12-07 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-04 Thread Sagi Grimberg




Yes, I'm very much in favour of this, too.
We always have this IMO slightly weird notion of stopping the queue, set
some error flags in the driver, then _restarting_ the queue, just so
that the driver then sees the error flag and terminates the requests.
Which I always found quite counter-intuitive.


What about requests that come in after the iteration runs? how are those
terminated?


If we've reached a dead state, I think you'd want to start a queue freeze
before running the terminating iterator.


Its not necessarily dead, in fabrics we need to handle disconnections
that last for a while before we are able to reconnect (for a variety of
reasons) and we need a way to fail I/O for failover (or requeue, or
block its up to the upper layer). Its less of a "last resort" action
like in the pci case.

Does this guarantee that after freeze+iter we won't get queued with any
other request? If not then we still need to unfreeze and fail at
queue_rq.


Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-04 Thread Sagi Grimberg




Yes, I'm very much in favour of this, too.
We always have this IMO slightly weird notion of stopping the queue, 
set

some error flags in the driver, then _restarting_ the queue, just so
that the driver then sees the error flag and terminates the requests.
Which I always found quite counter-intuitive.

What about requests that come in after the iteration runs? how are those
terminated?

If we've reached a dead state, I think you'd want to start a queue freeze
before running the terminating iterator.


For the requests that come in after the iterator, the nvmf_check_ready() 
routine, which validates controller state, will catch and bounce them.


The point of this patch was to omit the check in queue_rq like Keith
did in patch #2.


Re: [PATCH 10/13] nvme-mpath: remove I/O polling support

2018-12-04 Thread Sagi Grimberg




If it really becomes an issue we
should rework the nvme code to also skip the multipath code for any
private namespace, even if that could mean some trouble when rescanning.



This requires some explanation? skip the multipath code how?


We currently always go through the multipath node as long the the
controller is multipath capable.  If we care about e.g. polling
on a private namespace on a dual ported U.2 drive we'd have to make
sure we don't go through the multipath device node for private namespaces
that can only have one path, but only for shared namespaces.


But we'd still use the multipath node for shared namespaces (and also
polling if needed). I agree that private namespaces can skip the
multipath node.


Re: [PATCH 05/13] nvme-pci: consolidate code for polling non-dedicated queues

2018-12-04 Thread Sagi Grimberg




+static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag)


Do we still need to carry the tag around?


Yes, the timeout handler polls for a specific tag.


Does it have to? the documentation suggests that we missed
an interrupt, so it is probably waiting on the completion queue.

I'd say that it'd be better if the tag search would be implemented
on the timeout handler alone so we don't have to pass the tag around
for everyone... Thoughts?


Re: [PATCH 02/13] nvme-pci: use atomic bitops to mark a queue enabled

2018-12-04 Thread Sagi Grimberg




@@ -2173,6 +2157,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
if (nr_io_queues == 0)
return 0;
+   
+   clear_bit(NVMEQ_ENABLED, >flags);
   


This is a change of behavior, looks correct though as we can fail
nvme_setup_irqs after we freed the admin vector. Needs documentation
though..


I have a hard time parsing the above, can you point to where the problem
is?


This flag replaces cq_vector = -1 for indicating the queue state so I'd
expect that it would not appear where the former didn't. In this case
we clear the flag in a place that before we did not set the cq_vector
before, which seems correct to me though.


Re: [PATCH 01/13] block: move queues types to the block layer

2018-12-04 Thread Sagi Grimberg




Nit, there seems to be an extra newline that can be omitted here before
the else if statement (if I'm reading this correctly)...


Empty lines can always be ommited, but in this case I actually like it
as it seems to help readability..


If you think its useful I'm fine with it as is...


Re: [PATCH 7/7] blk-mq: use plug for devices that implement ->commits_rqs()

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 6/7] blk-mq: use bd->last == true for list inserts

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 4/7] virtio_blk: implement mq_ops->commit_rqs() hook

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 5/7] ataflop: implement mq_ops->commit_rqs() hook

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 2/7] blk-mq: add mq_ops->commit_rqs()

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 1/7] block: improve logic around when to sort a plug list

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-03 Thread Sagi Grimberg




A driver may wish to iterate every tagged request, not just ones that
satisfy blk_mq_request_started(). The intended use is so a driver may
terminate entered requests on quiesced queues.


How about we just move the started check into the handler passed in for
those that care about it? Much saner to make the interface iterate
everything, and leave whatever state check to the callback.


So we used to do that, and I changed it back in May to test for
MQ_RQ_IN_FLIGHT, and then Ming changed it to check
blk_mq_request_started.  So this is clearly a minefield of sorts..

Note that at least mtip32xx, nbd, skd and the various nvme transports
want to use the function to terminate all requests in the error
path, and it would be great to have one single understood, documented
and debugged helper for that in the core, so this is a vote for moving
more of the logic in your second helper into the core code.  skd
will need actually use ->complete to release resources for that, though
and mtip plays some odd abort bits.  If it weren't for the interesting
abort behavior in nvme-fc that means we could even unexport the
low-level interface.


Yes, I'm very much in favour of this, too.
We always have this IMO slightly weird notion of stopping the queue, set 
some error flags in the driver, then _restarting_ the queue, just so 
that the driver then sees the error flag and terminates the requests.

Which I always found quite counter-intuitive.


What about requests that come in after the iteration runs? how are those
terminated?


Re: [PATCH 13/13] block: enable polling by default if a poll map is initalized

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 12/13] block: only allow polling if a poll queue_map exists

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 11/13] block: remove ->poll_fn

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 10/13] nvme-mpath: remove I/O polling support

2018-12-03 Thread Sagi Grimberg

If it really becomes an issue we
should rework the nvme code to also skip the multipath code for any
private namespace, even if that could mean some trouble when rescanning.



This requires some explanation? skip the multipath code how?

Other than that,
Reviewed-by: Sagi Grimberg 


Re: [PATCH 08/13] nvme-pci: remove the CQ lock for interrupt driven queues

2018-12-03 Thread Sagi Grimberg




Now that we can't poll regular, interrupt driven I/O queues there
is almost nothing that can race with an interrupt.  The only
possible other contexts polling a CQ are the error handler and
queue shutdown, and both are so far off in the slow path that
we can simply use the big hammer of disabling interrupts.

With that we can stop taking the cq_lock for normal queues.


Nice,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 06/13] nvme-pci: refactor nvme_disable_io_queues

2018-12-03 Thread Sagi Grimberg




@@ -2428,7 +2426,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool 
shutdown)
nvme_stop_queues(>ctrl);
  
  	if (!dead && dev->ctrl.queue_count > 0) {

-   nvme_disable_io_queues(dev);
+   if (nvme_disable_io_queues(dev, nvme_admin_delete_sq))
+   


Would be nice if the opcode change would be kept inside but still
split like:

static void nvme_disable_io_queues(struct nvme_dev *dev)
{
if (__nvme_disable_io_queues(dev, nvme_admin_delete_sq))
__nvme_disable_io_queues(dev, nvme_admin_delete_cq);
}


Re: [PATCH 05/13] nvme-pci: consolidate code for polling non-dedicated queues

2018-12-03 Thread Sagi Grimberg




+static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag)


Do we still need to carry the tag around?

Other than that,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 04/13] nvme-pci: only allow polling with separate poll queues

2018-12-03 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 03/13] nvme-pci: cleanup SQ allocation a bit

2018-12-03 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 02/13] nvme-pci: use atomic bitops to mark a queue enabled

2018-12-03 Thread Sagi Grimberg




@@ -2173,6 +2157,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
  
  	if (nr_io_queues == 0)

return 0;
+   
+   clear_bit(NVMEQ_ENABLED, >flags);
  


This is a change of behavior, looks correct though as we can fail
nvme_setup_irqs after we freed the admin vector. Needs documentation 
though..


Re: [PATCH 01/13] block: move queues types to the block layer

2018-12-03 Thread Sagi Grimberg




On 12/2/18 8:46 AM, Christoph Hellwig wrote:

Having another indirect all in the fast path doesn't really help
in our post-spectre world.  Also having too many queue type is just
going to create confusion, so I'd rather manage them centrally.

Note that the queue type naming and ordering changes a bit - the
first index now is the default queue for everything not explicitly
marked, the optional ones are read and poll queues.

Signed-off-by: Christoph Hellwig 
---
  block/blk-mq-sysfs.c|  9 +-
  block/blk-mq.h  | 21 +++--
  drivers/nvme/host/pci.c | 68 +++--
  include/linux/blk-mq.h  | 15 -
  4 files changed, 51 insertions(+), 62 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 6efef1f679f0..9c2df137256a 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -173,9 +173,16 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct 
blk_mq_hw_ctx *hctx, char *page)
return ret;
  }
  
+static const char *const hctx_types[] = {

+   [HCTX_TYPE_DEFAULT] = "default",
+   [HCTX_TYPE_READ]= "read",
+   [HCTX_TYPE_POLL]= "poll",
+};
+
  static ssize_t blk_mq_hw_sysfs_type_show(struct blk_mq_hw_ctx *hctx, char 
*page)
  {
-   return sprintf(page, "%u\n", hctx->type);
+   BUILD_BUG_ON(ARRAY_SIZE(hctx_types) != HCTX_MAX_TYPES);
+   return sprintf(page, "%s\n", hctx_types[hctx->type]);
  }
  
  static struct attribute *default_ctx_attrs[] = {

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 7291e5379358..a664ea44ffd4 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -81,16 +81,14 @@ extern int blk_mq_hw_queue_to_node(struct blk_mq_queue_map 
*qmap, unsigned int);
  /*
   * blk_mq_map_queue_type() - map (hctx_type,cpu) to hardware queue
   * @q: request queue
- * @hctx_type: the hctx type index
+ * @type: the hctx type index
   * @cpu: CPU
   */
  static inline struct blk_mq_hw_ctx *blk_mq_map_queue_type(struct 
request_queue *q,
- unsigned int 
hctx_type,
+ enum hctx_type type,
  unsigned int cpu)
  {
-   struct blk_mq_tag_set *set = q->tag_set;
-
-   return q->queue_hw_ctx[set->map[hctx_type].mq_map[cpu]];
+   return q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]];
  }
  
  /*

@@ -103,12 +101,17 @@ static inline struct blk_mq_hw_ctx 
*blk_mq_map_queue(struct request_queue *q,
 unsigned int flags,
 unsigned int cpu)
  {
-   int hctx_type = 0;
+   enum hctx_type type = HCTX_TYPE_DEFAULT;
+
+   if (q->tag_set->nr_maps > HCTX_TYPE_POLL &&
+   ((flags & REQ_HIPRI) && test_bit(QUEUE_FLAG_POLL, >queue_flags)))
+   type = HCTX_TYPE_POLL;
  
-	if (q->mq_ops->rq_flags_to_type)

-   hctx_type = q->mq_ops->rq_flags_to_type(q, flags);
+   else if (q->tag_set->nr_maps > HCTX_TYPE_READ &&
+((flags & REQ_OP_MASK) == REQ_OP_READ))
+   type = HCTX_TYPE_READ;


Nit, there seems to be an extra newline that can be omitted here before
the else if statement (if I'm reading this correctly)...

Otherwise looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 2/2] nvme: create slaves/holder entries for multipath devices

2018-11-27 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 1/2] block: move holder tracking from struct block_device to hd_struct

2018-11-27 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 2/7] block: have ->poll_fn() return number of entries polled

2018-11-27 Thread Sagi Grimberg




We currently only really support sync poll, ie poll with 1
IO in flight. This prepares us for supporting async poll.


Hey Jens,

So are we sure that this is fine to simply replace the
poll functionality? you say that that we support poll
with only 1 I/O inflight but is it entirely true?
semantically speaking?


It is, unless you have multiple threads all doing polling.  Which is
pretty inefficient, as I'm sure you know.


I am, wasn't referring to multiple threads though..



The interface is selective polling, which means that
the user can have more than a single I/O inflight but wants
to poll for a specific one that it really cares about.

Would this introduce a regression for users that rely
on preadv2 to basically "complete *this* IO as fast as possible"?


No, it'll be exactly the same, and believe me, I've done plenty of
performance testing to ensure that it works well.  In fact, with the
changes queued up and this on top, polling is both faster and more
efficient than it ever has been before, for both the classic
preadv2/pwritev2 and the async polled IO.


I don't argue that this has not been tested, I was referring to the
following use-case: app queues say 16 I/Os with io_submit and then
issues preadv2 for an "important" 512B sector.

With this proposed change, is poll_fn more likely to return with the
first completion it sees rather than continue poll until it sees that
specific I/O it polled for?


I think that would be useless. For SYNC type polling with one thread, it
doesn't matter if we're looking for a particular IO, or just any IO.
Once your into two threads, you're already wasting huge amounts of CPU,
just to get to QD=2. The poll loop handles finding each others IOs just
fine, so there's no worry on that side.

Polling was originally designed for the strictly SYNC interfaces, and
since we had the cookie anyway, it was tempting to look for a specific
IO. I think that was a mistake, as it assumed that we'd never do async
polling.


Not arguing. Just now that we already have the selective interface in
place I'm asking is it OK to change that (IFF the above question is
indeed real) esoteric as this use-case may be...


Re: [PATCH 2/7] block: have ->poll_fn() return number of entries polled

2018-11-25 Thread Sagi Grimberg

We currently only really support sync poll, ie poll with 1
IO in flight. This prepares us for supporting async poll.


Hey Jens,

So are we sure that this is fine to simply replace the
poll functionality? you say that that we support poll
with only 1 I/O inflight but is it entirely true?
semantically speaking?

The interface is selective polling, which means that
the user can have more than a single I/O inflight but wants
to poll for a specific one that it really cares about.

Would this introduce a regression for users that rely
on preadv2 to basically "complete *this* IO as fast as possible"?

Note that I like the proposed direction, I'm merely questioning
if it is OK to simply change how selective polling worked until
today instead of introducing a separate blk_poll_all(q) (but with
a better name) which would be wired up to aio polling and friends.


Re: [PATCH 6/7] block: make blk_poll() take a parameter on whether to spin or not

2018-11-20 Thread Sagi Grimberg




This opportunistic poll is pretty bogus now as we never set the HIPRI
flag and it should probably be removed in a prep patch.  We should then
later try to use a scheme similar to your aio polling for the nvme
target as well.


I'll kill it in a pre-patch.


Agreed..


Re: [GIT PULL] nvme fixes for 4.20

2018-11-09 Thread Sagi Grimberg




That is because your for-linus was still 4.20-rc based when
I created it.


It's never been 4.20-rc based. After the initial merge, it got based
on a random commit in the merge window:

commit 3acbd2de6bc3af215c6ed7732dfc097d1e238503
Merge: d49f8a52b15b de7d83da84bd
Author: Linus Torvalds 
Date:   Thu Oct 25 09:00:15 2018 -0700

 Merge tag 'sound-4.20-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

and has been ever since.


That was because I rebased it when your for-linus didn't include Ming's
buggy version of the brd commit. Sorry for the mixup..


Re: [PATCH v2] block: fix rdma queue mapping

2018-10-30 Thread Sagi Grimberg




Subject: Re: [PATCH v2] block: fix rdma queue mapping



Sagi - From what I can tell, i40iw is also exposed to this same issue
if the IRQ affinity is configured by user.


managed affinity does not allow setting smp_affinity from userspace.


OK. But our device IRQs are not set-up as to be managed affinity based and can 
be tuned by user.
And it seems that is reasonable approach since we can never tell ahead of time 
what might be an optimal config. for a particular workload.


I understand, but its not necessarily the case, the vast majority of
users don't touch (or at least don't want to) and managed affinity gives
you benefits of linear assignment good practice (among others).

The same argument can be made for a number of pcie devices that do use
managed affinity, but the choice was to get it right from the start.
Still not sure why (r)nics are different with that respect.

Anyways, I was just asking, wasn't telling you to go and use it..


Re: [PATCH v2] block: fix rdma queue mapping

2018-10-23 Thread Sagi Grimberg




Sagi - From what I can tell, i40iw is also exposed to this same issue if the 
IRQ affinity
is configured by user.


managed affinity does not allow setting smp_affinity from userspace.


Re: [PATCH v2] block: fix rdma queue mapping

2018-10-23 Thread Sagi Grimberg




Christoph, Sagi:  it seems you think /proc/irq/$IRP/smp_affinity
shouldn't be allowed if drivers support managed affinity. Is that correct?


Not just shouldn't, but simply can't.


But as it stands, things are just plain borked if an rdma driver
supports ib_get_vector_affinity() yet the admin changes the affinity via
/proc...


I think we need to fix ib_get_vector_affinity to not return anything
if the device doesn't use managed irq affinity.


Steve, does iw_cxgb4 use managed affinity?

I'll send a patch for mlx5 to simply not return anything as managed
affinity is not something that the maintainers want to do.


I'm beginning to think I don't know what "managed affinity" actually is.  
Currently iw_cxgb4 doesn't support ib_get_vector_affinity().  I have a patch for it, but 
ran into this whole issue with nvme failing if someone changes the affinity map via /proc.


That means that the pci subsystem gets your vector(s) affinity right and
immutable. It also guarantees that you have reserved vectors and not get
a best effort assignment when cpu cores are offlined.

You can simply enable it by adding PCI_IRQ_AFFINITY to
pci_alloc_irq_vectors() or call pci_alloc_irq_vectors_affinity()
to communicate post/pre vectors that don't participate in
affinitization (nvme uses it for admin queue).

This way you can easily plug ->get_vector_affinity() to return
pci_irq_get_affinity(dev, vector)

The original patch set from hch:
https://lwn.net/Articles/693653/


Re: [PATCH v2] block: fix rdma queue mapping

2018-10-23 Thread Sagi Grimberg




Christoph, Sagi:  it seems you think /proc/irq/$IRP/smp_affinity
shouldn't be allowed if drivers support managed affinity. Is that correct?


Not just shouldn't, but simply can't.


But as it stands, things are just plain borked if an rdma driver
supports ib_get_vector_affinity() yet the admin changes the affinity via
/proc...


I think we need to fix ib_get_vector_affinity to not return anything
if the device doesn't use managed irq affinity.


Steve, does iw_cxgb4 use managed affinity?

I'll send a patch for mlx5 to simply not return anything as managed
affinity is not something that the maintainers want to do.


Re: [PATCH v2] block: fix rdma queue mapping

2018-10-15 Thread Sagi Grimberg




Can we please make forward progress on this?

Christoph, Sagi:  it seems you think /proc/irq/$IRP/smp_affinity
shouldn't be allowed if drivers support managed affinity. Is that 
correct?


Perhaps that can be codified and be a way forward?  IE: Somehow allow
the admin to choose either "managed by the driver/ulps" or "managed by
the system admin directly"?

Or just use Sagi's patch.  Perhaps a WARN_ONCE() if the affinity looks
wonked when set via procfs?  Just thinking out loud...


I think we should take my patch. While ideally we'd like to have
perfect linear irq based queue mapping, people apparently really like
to modify devices irq affinity. With this, at least some of the queues
are assigned according to the device affinity. Its not ideal, but at
least better than blindly falling to default mapping.

Christoph?

If there is strong resistance to the patch we should at least start
by falling to the default mapping unconditionally.


Ping?

Would you prefer we fallback to the naive mapping if the device is not
using manged affinity?


Re: [PATCH v2] block: fix rdma queue mapping

2018-10-03 Thread Sagi Grimberg




Can we please make forward progress on this?

Christoph, Sagi:  it seems you think /proc/irq/$IRP/smp_affinity
shouldn't be allowed if drivers support managed affinity. Is that correct?

Perhaps that can be codified and be a way forward?  IE: Somehow allow
the admin to choose either "managed by the driver/ulps" or "managed by
the system admin directly"?

Or just use Sagi's patch.  Perhaps a WARN_ONCE() if the affinity looks
wonked when set via procfs?  Just thinking out loud...


I think we should take my patch. While ideally we'd like to have
perfect linear irq based queue mapping, people apparently really like
to modify devices irq affinity. With this, at least some of the queues
are assigned according to the device affinity. Its not ideal, but at
least better than blindly falling to default mapping.

Christoph?

If there is strong resistance to the patch we should at least start
by falling to the default mapping unconditionally.


Re: [PATCH] block: fix rdma queue mapping

2018-08-24 Thread Sagi Grimberg




Hi Sagi,
did you have a chance to look on Israel's and mine fixes that we 
attached to the first thread ?


there are few issues with this approach. For example in case you don't 
have a "free" cpu in the mask for Qi and you take cpu from Qi+j mask.


Also in case we have a non-symetrical affinity for 2 queues e.g. and the 
system has 32 cpus, the mapping is 16 for each (and not according to the 
user affinity...).


Can you explain a little better? From Steve's mappings it behaved like I
expected it to.


Re: [PATCH v2] block: fix rdma queue mapping

2018-08-24 Thread Sagi Grimberg




nvme-rdma attempts to map queues based on irq vector affinity.
However, for some devices, completion vector irq affinity is
configurable by the user which can break the existing assumption
that irq vectors are optimally arranged over the host cpu cores.


IFF affinity is configurable we should never use this code,
as it breaks the model entirely.  ib_get_vector_affinity should
never return a valid mask if affinity is configurable.


I agree that the model intended initially doesn't fit. But it seems
that some users like to write into their nic's
/proc/irq/$IRP/smp_affinity and get mad at us for not letting them with
using managed affinity.

So instead of falling back to the block mapping function we try
to do a little better first:
1. map according to the device vector affinity
2. map vectors that end up without a mapping to cpus that belong
   to the same numa-node
3. map all the rest of the unmapped cpus like the block layer
   would do.

We could have device drivers that don't use managed affinity to never
return a valid mask but that would never allow affinity based mapping
which is optimal at least for users that do not mangle with device
irq affinity (which is probably the majority of users).

Thoughts?


Re: [PATCH v2] block: fix rdma queue mapping

2018-08-24 Thread Sagi Grimberg




I guess this way still can't fix the request allocation crash issue
triggered by using blk_mq_alloc_request_hctx(), in which one hw queue may
not be mapped from any online CPU.


Not really. I guess we will need to simply skip queues that are
mapped to an offline cpu.


Maybe this patch isn't for this issue, but it is closely related.


Yes, another patch is still needed.

Steve, do you still have that patch? I don't seem to
find it anywhere.


[PATCH v2] block: fix rdma queue mapping

2018-08-20 Thread Sagi Grimberg
nvme-rdma attempts to map queues based on irq vector affinity.
However, for some devices, completion vector irq affinity is
configurable by the user which can break the existing assumption
that irq vectors are optimally arranged over the host cpu cores.

So we map queues in two stages:
First map queues according to corresponding to the completion
vector IRQ affinity taking the first cpu in the vector affinity map.
if the current irq affinity is arranged such that a vector is not
assigned to any distinct cpu, we map it to a cpu that is on the same
node. If numa affinity can not be sufficed, we map it to any unmapped
cpu we can find. Then, map the remaining cpus in the possible cpumap
naively.

Tested-by: Steve Wise 
Signed-off-by: Sagi Grimberg 
---
Changes from v1:
- fixed double semicolon typo

 block/blk-mq-cpumap.c  | 39 +++-
 block/blk-mq-rdma.c| 80 --
 include/linux/blk-mq.h |  1 +
 3 files changed, 93 insertions(+), 27 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 3eb169f15842..34811db8cba9 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -30,30 +30,35 @@ static int get_first_sibling(unsigned int cpu)
return cpu;
 }
 
-int blk_mq_map_queues(struct blk_mq_tag_set *set)
+void blk_mq_map_queue_cpu(struct blk_mq_tag_set *set, unsigned int cpu)
 {
unsigned int *map = set->mq_map;
unsigned int nr_queues = set->nr_hw_queues;
-   unsigned int cpu, first_sibling;
+   unsigned int first_sibling;
 
-   for_each_possible_cpu(cpu) {
-   /*
-* First do sequential mapping between CPUs and queues.
-* In case we still have CPUs to map, and we have some number of
-* threads per cores then map sibling threads to the same queue 
for
-* performace optimizations.
-*/
-   if (cpu < nr_queues) {
+   /*
+* First do sequential mapping between CPUs and queues.
+* In case we still have CPUs to map, and we have some number of
+* threads per cores then map sibling threads to the same queue for
+* performace optimizations.
+*/
+   if (cpu < nr_queues) {
+   map[cpu] = cpu_to_queue_index(nr_queues, cpu);
+   } else {
+   first_sibling = get_first_sibling(cpu);
+   if (first_sibling == cpu)
map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-   } else {
-   first_sibling = get_first_sibling(cpu);
-   if (first_sibling == cpu)
-   map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-   else
-   map[cpu] = map[first_sibling];
-   }
+   else
+   map[cpu] = map[first_sibling];
}
+}
+
+int blk_mq_map_queues(struct blk_mq_tag_set *set)
+{
+   unsigned int cpu;
 
+   for_each_possible_cpu(cpu)
+   blk_mq_map_queue_cpu(set, cpu);
return 0;
 }
 EXPORT_SYMBOL_GPL(blk_mq_map_queues);
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f1de18..3bce60cd5bcf 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -14,6 +14,61 @@
 #include 
 #include 
 
+static int blk_mq_rdma_map_queue(struct blk_mq_tag_set *set,
+   struct ib_device *dev, int first_vec, unsigned int queue)
+{
+   const struct cpumask *mask;
+   unsigned int cpu;
+   bool mapped = false;
+
+   mask = ib_get_vector_affinity(dev, first_vec + queue);
+   if (!mask)
+   return -ENOTSUPP;
+
+   /* map with an unmapped cpu according to affinity mask */
+   for_each_cpu(cpu, mask) {
+   if (set->mq_map[cpu] == UINT_MAX) {
+   set->mq_map[cpu] = queue;
+   mapped = true;
+   break;
+   }
+   }
+
+   if (!mapped) {
+   int n;
+
+   /* map with an unmapped cpu in the same numa node */
+   for_each_node(n) {
+   const struct cpumask *node_cpumask = cpumask_of_node(n);
+
+   if (!cpumask_intersects(mask, node_cpumask))
+   continue;
+
+   for_each_cpu(cpu, node_cpumask) {
+   if (set->mq_map[cpu] == UINT_MAX) {
+   set->mq_map[cpu] = queue;
+   mapped = true;
+   break;
+   }
+   }
+   }
+   }
+
+   if (!mapped) {
+   /* map with any unmapped cpu we can find */
+   for_each_possible_cpu(cpu) {
+   if (set->mq_map[cpu] == UINT_MAX) {
+

[PATCH] block: fix rdma queue mapping

2018-08-17 Thread Sagi Grimberg
nvme-rdma attempts to map queues based on irq vector affinity.
However, for some devices, completion vector irq affinity is
configurable by the user which can break the existing assumption
that irq vectors are optimally arranged over the host cpu cores.

So we map queues in two stages:
First map queues according to corresponding to the completion
vector IRQ affinity taking the first cpu in the vector affinity map.
if the current irq affinity is arranged such that a vector is not
assigned to any distinct cpu, we map it to a cpu that is on the same
node. If numa affinity can not be sufficed, we map it to any unmapped
cpu we can find. Then, map the remaining cpus in the possible cpumap
naively.

Tested-by: Steve Wise 
Signed-off-by: Sagi Grimberg 
---
See thread that started here: 
https://marc.info/?l=linux-rdma=153172982318299=2

 block/blk-mq-cpumap.c  | 39 +---
 block/blk-mq-rdma.c| 80 +++---
 include/linux/blk-mq.h |  1 +
 3 files changed, 93 insertions(+), 27 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 3eb169f15842..34811db8cba9 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -30,30 +30,35 @@ static int get_first_sibling(unsigned int cpu)
return cpu;
 }
 
-int blk_mq_map_queues(struct blk_mq_tag_set *set)
+void blk_mq_map_queue_cpu(struct blk_mq_tag_set *set, unsigned int cpu)
 {
unsigned int *map = set->mq_map;
unsigned int nr_queues = set->nr_hw_queues;
-   unsigned int cpu, first_sibling;
+   unsigned int first_sibling;
 
-   for_each_possible_cpu(cpu) {
-   /*
-* First do sequential mapping between CPUs and queues.
-* In case we still have CPUs to map, and we have some number of
-* threads per cores then map sibling threads to the same queue 
for
-* performace optimizations.
-*/
-   if (cpu < nr_queues) {
+   /*
+* First do sequential mapping between CPUs and queues.
+* In case we still have CPUs to map, and we have some number of
+* threads per cores then map sibling threads to the same queue for
+* performace optimizations.
+*/
+   if (cpu < nr_queues) {
+   map[cpu] = cpu_to_queue_index(nr_queues, cpu);
+   } else {
+   first_sibling = get_first_sibling(cpu);
+   if (first_sibling == cpu)
map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-   } else {
-   first_sibling = get_first_sibling(cpu);
-   if (first_sibling == cpu)
-   map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-   else
-   map[cpu] = map[first_sibling];
-   }
+   else
+   map[cpu] = map[first_sibling];
}
+}
+
+int blk_mq_map_queues(struct blk_mq_tag_set *set)
+{
+   unsigned int cpu;
 
+   for_each_possible_cpu(cpu)
+   blk_mq_map_queue_cpu(set, cpu);
return 0;
 }
 EXPORT_SYMBOL_GPL(blk_mq_map_queues);
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f1de18..d04cbb1925f5 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -14,6 +14,61 @@
 #include 
 #include 
 
+static int blk_mq_rdma_map_queue(struct blk_mq_tag_set *set,
+   struct ib_device *dev, int first_vec, unsigned int queue)
+{
+   const struct cpumask *mask;
+   unsigned int cpu;
+   bool mapped = false;
+
+   mask = ib_get_vector_affinity(dev, first_vec + queue);
+   if (!mask)
+   return -ENOTSUPP;
+
+   /* map with an unmapped cpu according to affinity mask */
+   for_each_cpu(cpu, mask) {
+   if (set->mq_map[cpu] == UINT_MAX) {
+   set->mq_map[cpu] = queue;
+   mapped = true;
+   break;
+   }
+   }
+
+   if (!mapped) {
+   int n;
+
+   /* map with an unmapped cpu in the same numa node */
+   for_each_node(n) {
+   const struct cpumask *node_cpumask = cpumask_of_node(n);
+
+   if (!cpumask_intersects(mask, node_cpumask))
+   continue;
+
+   for_each_cpu(cpu, node_cpumask) {
+   if (set->mq_map[cpu] == UINT_MAX) {
+   set->mq_map[cpu] = queue;
+   mapped = true;
+   break;
+   }
+   }
+   }
+   }
+
+   if (!mapped) {
+   /* map with any unmapped cpu we can find */
+   for_each_possible_cpu(cpu) {
+   if (set->mq_map[cpu] == UINT_MAX) {
+   

Re: [PATCH v2 2/2] nvme: Fix race conditions related to creation of /dev/nvme0n

2018-08-17 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH] block: Fix cloning of requests with a special payload

2018-06-28 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH] block: pass failfast and driver-specific flags to flush requests

2018-06-07 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCHv2] block: always set partition number to '0' in blk_partition_remap()

2018-06-07 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 12/12] nvme: limit warnings from nvme_identify_ns

2018-05-30 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 11/12] nvme: use the changed namespaces list log to clear ns data changed AENs

2018-05-30 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 10/12] nvme: mark nvme_queue_scan static

2018-05-30 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 08/12] nvmet: mask pending AERs

2018-05-30 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 09/12] nvme: submit AEN event configuration on startup

2018-05-30 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 07/12] nvmet: Add AEN configuration support

2018-05-30 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 06/12] nvmet: implement the changed namespaces log

2018-05-30 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 05/12] nvmet: split log page implementation

2018-05-30 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 04/12] nvmet: add a new nvmet_zero_sgl helper

2018-05-30 Thread Sagi Grimberg




+u16 nvmet_zero_sgl(struct nvmet_req *req, off_t off, size_t len)


Would personally prefer nvmet_clear_sgl, but not hard headed on it,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 03/12] nvme.h: add AER configuration symbols

2018-05-30 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 02/12] nvme.h: add the changed namespace list log

2018-05-30 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 01/12] nvme.h: untangle AEN notice definitions

2018-05-30 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-11 Thread Sagi Grimberg



   static void __blk_mq_requeue_request(struct request *rq)
   {
struct request_queue *q = rq->q;
+   enum mq_rq_state old_state = blk_mq_rq_state(rq);
   
   	blk_mq_put_driver_tag(rq);
   
   	trace_block_rq_requeue(q, rq);

wbt_requeue(q->rq_wb, >issue_stat);
   
-	if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {

-   blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+   if (old_state != MQ_RQ_IDLE) {
+   if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
+   WARN_ON_ONCE(true);
if (q->dma_drain_size && blk_rq_bytes(rq))
rq->nr_phys_segments--;
}


Can you explain why was old_state kept as a local variable?


Hello Sagi,

Since blk_mq_requeue_request() must be called after a request has completed
the timeout handler will ignore requests that are being requeued. Hence it
is safe in this function to cache the request state in a local variable.


I understand why it is safe, I just didn't understand why it is needed.


+static inline bool blk_mq_change_rq_state(struct request *rq,
+ enum mq_rq_state old_state,
+ enum mq_rq_state new_state)
   {
-   u64 old_val = READ_ONCE(rq->gstate);
-   u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
-
-   if (state == MQ_RQ_IN_FLIGHT) {
-   WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
-   new_val += MQ_RQ_GEN_INC;
-   }
+   unsigned long old_val = (READ_ONCE(rq->__deadline) & ~RQ_STATE_MASK) |
+   old_state;
+   unsigned long new_val = (old_val & ~RQ_STATE_MASK) | new_state;
   
-	/* avoid exposing interim values */

-   WRITE_ONCE(rq->gstate, new_val);
+   return cmpxchg(>__deadline, old_val, new_val) == old_val;
   }


Can you explain why this takes the old_state of the request?


Can you clarify your question? The purpose of this function is to change
the request state only into @new_state if it matches @old_state. I think
that is also what the implementation of this function does.


I misread the documentation of this. never mind. thanks.


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-11 Thread Sagi Grimberg



Ah, yeah, I was moving it out of add_timer but forgot to actully add
it to the issue path.  Fixed patch below.

BTW, no matter what we do w/ the request handover between normal and
timeout paths, we'd need something similar.  Otherwise, while we can
reduce the window, we can't get rid of it.

Thanks.


Just noticed this one, this looks interesting to me as well. Israel,
can you run your test with this patch?



---
  block/blk-mq.c |   45 -
  include/linux/blkdev.h |2 ++
  2 files changed, 34 insertions(+), 13 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
hctx_lock(hctx, _idx);
if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
+   else
+   rq->missed_completion = true;
hctx_unlock(hctx, srcu_idx);
  }
  EXPORT_SYMBOL(blk_mq_complete_request);
@@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
  
  	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);

blk_add_timer(rq);
+   rq->missed_completion = false;
  
  	write_seqcount_end(>gstate_seq);

preempt_enable();
@@ -881,7 +884,7 @@ static void blk_mq_check_expired(struct
}
  }
  
-static void blk_mq_timeout_sync_rcu(struct request_queue *q)

+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
  {
struct blk_mq_hw_ctx *hctx;
bool has_rcu = false;
@@ -896,7 +899,8 @@ static void blk_mq_timeout_sync_rcu(stru
else
synchronize_srcu(hctx->srcu);
  
-		hctx->need_sync_rcu = false;

+   if (clear)
+   hctx->need_sync_rcu = false;
}
if (has_rcu)
synchronize_rcu();
@@ -917,25 +921,37 @@ static void blk_mq_terminate_expired(str
blk_mq_rq_timed_out(hctx, rq, priv, reserved);
  }
  
-static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,

+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
struct request *rq, void *priv, bool reserved)
  {
/*
 * @rq's timer reset has gone through rcu synchronization and is
 * visible now.  Allow normal completions again by resetting
 * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
-* there's no memory ordering around ->aborted_gstate making it the
-* only field safe to update.  Let blk_add_timer() clear it later
-* when the request is recycled or times out again.
-*
-* As nothing prevents from completion happening while
-* ->aborted_gstate is set, this may lead to ignored completions
-* and further spurious timeouts.
+* blk_mq_timeout_reset_cleanup() needs it again and there's no
+* memory ordering around ->aborted_gstate making it the only field
+* safe to update.  Let blk_add_timer() clear it later when the
+* request is recycled or times out again.
 */
if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
blk_mq_rq_update_aborted_gstate(rq, 0);
  }
  
+static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,

+   struct request *rq, void *priv, bool reserved)
+{
+   /*
+* @rq is now fully returned to the normal path.  If normal
+* completion raced timeout handling, execute the missed completion
+* here.  This is safe because 1. ->missed_completion can no longer
+* be asserted because nothing is timing out right now and 2. if
+* ->missed_completion is set, @rq is safe from recycling because
+* nobody could have completed it.
+*/
+   if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion)
+   blk_mq_complete_request(rq);
+}
+
  static void blk_mq_timeout_work(struct work_struct *work)
  {
struct request_queue *q =
@@ -976,7 +992,7 @@ static void blk_mq_timeout_work(struct w
 * becomes a problem, we can add per-hw_ctx rcu_head and
 * wait in parallel.
 */
-   blk_mq_timeout_sync_rcu(q);
+   blk_mq_timeout_sync_rcu(q, true);
  
  		/* terminate the ones we won */

blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
@@ -988,9 +1004,12 @@ static void blk_mq_timeout_work(struct w
 * reset racing against recycling.
 */
if (nr_resets) {
-   blk_mq_timeout_sync_rcu(q);
+   blk_mq_timeout_sync_rcu(q, false);
+   blk_mq_queue_tag_busy_iter(q,
+   blk_mq_timeout_reset_return, NULL);
+   blk_mq_timeout_sync_rcu(q, true);
blk_mq_queue_tag_busy_iter(q,
-   blk_mq_finish_timeout_reset, NULL);
+   

Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7

2018-04-11 Thread Sagi Grimberg



diff --git a/block/blk-mq.c b/block/blk-mq.c
index 75336848f7a7..81ced3096433 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
request_queue *q,
 return ERR_PTR(-EXDEV);
 }
 cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
+   if (cpu >= nr_cpu_ids) {
+   pr_warn("no online cpu for hctx %d\n", hctx_idx);
+   cpu = cpumask_first(alloc_data.hctx->cpumask);
+   }


We may do this way for the special case, but it is ugly, IMO.


Christoph?


Re: [PATCH] blk-mq: Revert "blk-mq: reimplement blk_mq_hw_queue_mapped"

2018-04-11 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg <s...@grimberg.me>


Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-11 Thread Sagi Grimberg



  static void __blk_mq_requeue_request(struct request *rq)
  {
struct request_queue *q = rq->q;
+   enum mq_rq_state old_state = blk_mq_rq_state(rq);
  
  	blk_mq_put_driver_tag(rq);
  
  	trace_block_rq_requeue(q, rq);

wbt_requeue(q->rq_wb, >issue_stat);
  
-	if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {

-   blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+   if (old_state != MQ_RQ_IDLE) {
+   if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
+   WARN_ON_ONCE(true);
if (q->dma_drain_size && blk_rq_bytes(rq))
rq->nr_phys_segments--;
}


Can you explain why was old_state kept as a local variable?


+static inline bool blk_mq_change_rq_state(struct request *rq,
+ enum mq_rq_state old_state,
+ enum mq_rq_state new_state)
  {
-   u64 old_val = READ_ONCE(rq->gstate);
-   u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
-
-   if (state == MQ_RQ_IN_FLIGHT) {
-   WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
-   new_val += MQ_RQ_GEN_INC;
-   }
+   unsigned long old_val = (READ_ONCE(rq->__deadline) & ~RQ_STATE_MASK) |
+   old_state;
+   unsigned long new_val = (old_val & ~RQ_STATE_MASK) | new_state;
  
-	/* avoid exposing interim values */

-   WRITE_ONCE(rq->gstate, new_val);
+   return cmpxchg(>__deadline, old_val, new_val) == old_val;
  }


Can you explain why this takes the old_state of the request?

Otherwise this looks good to me,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>


Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-11 Thread Sagi Grimberg



The blk-mq timeout handling code ignores completions that occur after
blk_mq_check_expired() has been called and before blk_mq_rq_timed_out()
has reset rq->aborted_gstate. If a block driver timeout handler always
returns BLK_EH_RESET_TIMER then the result will be that the request
never terminates.


Under this situation:

IMO, if this request has been handled by driver's irq handler, and if
driver's .timeout still returns BLK_EH_RESET_TIMER, it is driver's bug,
and the correct return value should be BLK_EH_HANDLED.


Is it possible to guarantee this efficiently if the irq handler
can run concurrently with the timeout handler?


Re: [PATCH 8/8] blk-mq: remove code for dealing with remapping queue

2018-04-09 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>


Re: [PATCH 7/8] blk-mq: reimplement blk_mq_hw_queue_mapped

2018-04-09 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>


Re: [PATCH 6/8] blk-mq: don't check queue mapped in __blk_mq_delay_run_hw_queue()

2018-04-09 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>


Re: [PATCH 5/8] blk-mq: remove blk_mq_delay_queue()

2018-04-09 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>


Re: [PATCH 4/8] blk-mq: introduce blk_mq_hw_queue_first_cpu() to figure out first cpu

2018-04-09 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>


Re: [PATCH 3/8] blk-mq: avoid to write intermediate result to hctx->next_cpu

2018-04-09 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>


Re: [PATCH 2/8] blk-mq: don't keep offline CPUs mapped to hctx 0

2018-04-09 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>


Re: [PATCH 1/8] blk-mq: make sure that correct hctx->next_cpu is set

2018-04-09 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>


Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7

2018-04-09 Thread Sagi Grimberg



Hi Sagi
Sorry for the late response, bellow patch works, here is the full log:


Thanks for testing!

Now that we isolated the issue, the question is if this fix is correct
given that we are guaranteed that the connect context will run on an
online cpu?

another reference to the patch (we can make the pr_warn a pr_debug):
--
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 75336848f7a7..81ced3096433 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,

return ERR_PTR(-EXDEV);
}
cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
+   if (cpu >= nr_cpu_ids) {
+   pr_warn("no online cpu for hctx %d\n", hctx_idx);
+   cpu = cpumask_first(alloc_data.hctx->cpumask);
+   }
alloc_data.ctx = __blk_mq_get_ctx(q, cpu);

rq = blk_mq_get_request(q, NULL, op, _data);
--


Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code

2018-04-09 Thread Sagi Grimberg



If a completion occurs after blk_mq_rq_timed_out() has reset
rq->aborted_gstate and the request is again in flight when the timeout
expires then a request will be completed twice: a first time by the
timeout handler and a second time when the regular completion occurs.

Additionally, the blk-mq timeout handling code ignores completions that
occur after blk_mq_check_expired() has been called and before
blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver
timeout handler always returns BLK_EH_RESET_TIMER then the result will
be that the request never terminates.


OK, now I understand how we can complete twice. Israel, can you verify
this patch solves your double completion problem?

Given that it is, the change log of your patches should be modified to
the original bug report it solves.

Thread starts here:
http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html


Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7

2018-04-09 Thread Sagi Grimberg



My device exposes nr_hw_queues which is not higher than num_online_cpus
so I want to connect all hctxs with hope that they will be used.


The issue is that CPU online & offline can happen any time, and after
blk-mq removes CPU hotplug handler, there is no way to remap queue
when CPU topo is changed.

For example:

1) after nr_hw_queues is set as num_online_cpus() and hw queues
are initialized, then some of CPUs become offline, and the issue
reported by Zhang Yi is triggered, but in this case, we should fail
the allocation since 1:1 mapping doesn't need to use this inactive
hw queue.


Normal cpu offlining is fine, as the hctxs are already connected. When
we reset the controller and re-establish the queues, the issue triggers
because we call blk_mq_alloc_request_hctx.

The question is, for this particular issue, given that the request
execution is guaranteed to run from an online cpu, will the below work?
--
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 75336848f7a7..81ced3096433 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,

return ERR_PTR(-EXDEV);
}
cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
+   if (cpu >= nr_cpu_ids) {
+   pr_warn("no online cpu for hctx %d\n", hctx_idx);
+   cpu = cpumask_first(alloc_data.hctx->cpumask);
+   }
alloc_data.ctx = __blk_mq_get_ctx(q, cpu);

rq = blk_mq_get_request(q, NULL, op, _data);
--


2) when nr_hw_queues is set as num_online_cpus(), there may be
much less online CPUs, so the hw queue number can be initialized as
much smaller, then performance is degraded much even if some CPUs
become online later.


That is correct, when the controller will be reset though, more queues
will be added to the system. I agree it would be good if we can change
stuff dynamically.


So the current policy is to map all possible CPUs for handing CPU
hotplug, and if you want to get 1:1 mapping between hw queue and
online CPU, the nr_hw_queues can be set as num_possible_cpus.


Having nr_hw_queues == num_possible_cpus cannot work as it requires
establishing an RDMA queue-pair with a set of HW resources both on
the host side _and_ on the controller side, which are idle.


Please see commit 16ccfff28976130 (nvme: pci: pass max vectors as
num_possible_cpus() to pci_alloc_irq_vectors).


Yes, I am aware of this patch, however I not sure it'll be a good idea
for nvmf as it takes resources from both the host and the target for
for cpus that may never come online...


It will waste some memory resource just like percpu variable, but it
simplifies the queue mapping logic a lot, and can support both hard
and soft CPU online/offline without CPU hotplug handler, which may
cause very complicated queue dependency issue.


Yes, but these some memory resources are becoming an issue when it
takes HW (RDMA) resources on the local device and on the target device.


I agree we don't want to connect hctx which doesn't have an online
cpu, that's redundant, but this is not the case here.


OK, I will explain below, and it can be fixed by the following patch too:

https://marc.info/?l=linux-block=152318093725257=2



I agree this patch is good!


Or I may understand you wrong, :-)


In the report we connected 40 hctxs (which was exactly the number of
online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
I'm not sure why some hctxs are left without any online cpus.


That is possible after the following two commits:

4b855ad37194 ("blk-mq: Create hctx for each present CPU)
20e4d8139319 (blk-mq: simplify queue mapping & schedule with each possisble CPU)

And this can be triggered even without putting down any CPUs.

The blk-mq CPU hotplug handler is removed in 4b855ad37194, and we can't
remap queue any more when CPU topo is changed, so the static & fixed mapping
has to be setup from the beginning.

Then if there are less enough online CPUs compared with number of hw queues,
some of hctxes can be mapped with all offline CPUs. For example, if one device
has 4 hw queues, but there are only 2 online CPUs and 6 offline CPUs, at most
2 hw queues are assigned to online CPUs, and the other two are all with offline
CPUs.


That is fine, but the problem that I gave in the example below which has
nr_hw_queues == num_online_cpus but because of the mapping, we still
have unmapped hctxs.


For FC's case, there may be some hctxs not 'mapped', which is caused by
blk_mq_map_queues(), but that should one bug.

So the patch(blk-mq: don't keep offline CPUs mapped to hctx 0)[1] is
fixing the issue:

[1] https://marc.info/?l=linux-block=152318093725257=2

Once this patch is in, any hctx should be mapped by at least one CPU.


I think this will solve the problem Yi is stepping on.


Then later, the patch(blk-mq: reimplement blk_mq_hw_queue_mapped)[2]
extends the mapping concept, maybe it should have been renamed 

Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7

2018-04-08 Thread Sagi Grimberg



On 04/08/2018 03:57 PM, Ming Lei wrote:

On Sun, Apr 08, 2018 at 02:53:03PM +0300, Sagi Grimberg wrote:



Hi Sagi

Still can reproduce this issue with the change:


Thanks for validating Yi,

Would it be possible to test the following:
--
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 75336848f7a7..81ced3096433 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
request_queue *q,
   return ERR_PTR(-EXDEV);
   }
   cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
+   if (cpu >= nr_cpu_ids) {
+   pr_warn("no online cpu for hctx %d\n", hctx_idx);
+   cpu = cpumask_first(alloc_data.hctx->cpumask);
+   }
   alloc_data.ctx = __blk_mq_get_ctx(q, cpu);

   rq = blk_mq_get_request(q, NULL, op, _data);
--
...



[  153.384977] BUG: unable to handle kernel paging request at
3a9ed053bd48
[  153.393197] IP: blk_mq_get_request+0x23e/0x390


Also would it be possible to provide gdb output of:

l *(blk_mq_get_request+0x23e)


nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
request from one specific hw queue, but there may not be all online CPUs
mapped to this hw queue.


Yes, this is what I suspect..


And the following patchset may fail this kind of allocation and avoid
the kernel oops.

https://marc.info/?l=linux-block=152318091025252=2


Thanks Ming,

But I don't want to fail the allocation, nvmf_connect_io_queue simply
needs a tag to issue the connect request, I much rather to take this
tag from an online cpu than failing it... We use this because we reserve


The failure is only triggered when there isn't any online CPU mapped to
this hctx, so do you want to wait for CPUs for this hctx becoming online?


I was thinking of allocating a tag from that hctx even if it had no
online cpu, the execution is done on an online cpu (hence the call
to blk_mq_alloc_request_hctx).


That can be done, but not following the current blk-mq's rule, because
blk-mq requires to dispatch the request on CPUs mapping to this hctx.

Could you explain a bit why you want to do in this way?


My device exposes nr_hw_queues which is not higher than num_online_cpus
so I want to connect all hctxs with hope that they will be used.

I agree we don't want to connect hctx which doesn't have an online
cpu, that's redundant, but this is not the case here.


Or I may understand you wrong, :-)


In the report we connected 40 hctxs (which was exactly the number of
online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
I'm not sure why some hctxs are left without any online cpus.


That is possible after the following two commits:

4b855ad37194 ("blk-mq: Create hctx for each present CPU)
20e4d8139319 (blk-mq: simplify queue mapping & schedule with each possisble CPU)

And this can be triggered even without putting down any CPUs.

The blk-mq CPU hotplug handler is removed in 4b855ad37194, and we can't
remap queue any more when CPU topo is changed, so the static & fixed mapping
has to be setup from the beginning.

Then if there are less enough online CPUs compared with number of hw queues,
some of hctxes can be mapped with all offline CPUs. For example, if one device
has 4 hw queues, but there are only 2 online CPUs and 6 offline CPUs, at most
2 hw queues are assigned to online CPUs, and the other two are all with offline
CPUs.


That is fine, but the problem that I gave in the example below which has 
nr_hw_queues == num_online_cpus but because of the mapping, we still

have unmapped hctxs.


Lets say I have 4-cpu system and my device always allocates
num_online_cpus() hctxs.

at first I get:
cpu0 -> hctx0
cpu1 -> hctx1
cpu2 -> hctx2
cpu3 -> hctx3

When cpu1 goes offline I think the new mapping will be:
cpu0 -> hctx0
cpu1 -> hctx0 (from cpu_to_queue_index) // offline
cpu2 -> hctx2
cpu3 -> hctx0 (from cpu_to_queue_index)

This means that now hctx1 is unmapped. I guess we can fix nvmf code
to not connect it. But we end up with less queues than cpus without
any good reason.

I would have optimally want a different mapping that will use all
the queues:
cpu0 -> hctx0
cpu2 -> hctx1
cpu3 -> hctx2
* cpu1 -> hctx1 (doesn't matter, offline)

Something looks broken...


No, it isn't broken.


maybe broken is the wrong phrase, but its suboptimal...


Storage is client/server model, the hw queue should be only active if
there is request coming from client(CPU),


Correct.


and the hw queue becomes inactive if no online CPU is mapped to it.


But when we reset the controller, we call blk_mq_update_nr_hw_queues()
with the current number of nr_hw_queues which never exceeds
num_online_cpus. This in turn, remaps the mq_map which results
in unmapped queues because of the mapping function, not because we
have more hctx than online cpus...

An easy fix, is to allocate num_present_cpus

Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7

2018-04-08 Thread Sagi Grimberg



Hi Sagi

Still can reproduce this issue with the change:


Thanks for validating Yi,

Would it be possible to test the following:
--
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 75336848f7a7..81ced3096433 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
request_queue *q,
  return ERR_PTR(-EXDEV);
  }
  cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
+   if (cpu >= nr_cpu_ids) {
+   pr_warn("no online cpu for hctx %d\n", hctx_idx);
+   cpu = cpumask_first(alloc_data.hctx->cpumask);
+   }
  alloc_data.ctx = __blk_mq_get_ctx(q, cpu);

  rq = blk_mq_get_request(q, NULL, op, _data);
--
...



[  153.384977] BUG: unable to handle kernel paging request at
3a9ed053bd48
[  153.393197] IP: blk_mq_get_request+0x23e/0x390


Also would it be possible to provide gdb output of:

l *(blk_mq_get_request+0x23e)


nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
request from one specific hw queue, but there may not be all online CPUs
mapped to this hw queue.


Yes, this is what I suspect..


And the following patchset may fail this kind of allocation and avoid
the kernel oops.

https://marc.info/?l=linux-block=152318091025252=2


Thanks Ming,

But I don't want to fail the allocation, nvmf_connect_io_queue simply
needs a tag to issue the connect request, I much rather to take this
tag from an online cpu than failing it... We use this because we reserve


The failure is only triggered when there isn't any online CPU mapped to
this hctx, so do you want to wait for CPUs for this hctx becoming online?


I was thinking of allocating a tag from that hctx even if it had no
online cpu, the execution is done on an online cpu (hence the call
to blk_mq_alloc_request_hctx).


Or I may understand you wrong, :-)


In the report we connected 40 hctxs (which was exactly the number of
online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
I'm not sure why some hctxs are left without any online cpus.

This seems to be related to the queue mapping.

Lets say I have 4-cpu system and my device always allocates
num_online_cpus() hctxs.

at first I get:
cpu0 -> hctx0
cpu1 -> hctx1
cpu2 -> hctx2
cpu3 -> hctx3

When cpu1 goes offline I think the new mapping will be:
cpu0 -> hctx0
cpu1 -> hctx0 (from cpu_to_queue_index) // offline
cpu2 -> hctx2
cpu3 -> hctx0 (from cpu_to_queue_index)

This means that now hctx1 is unmapped. I guess we can fix nvmf code
to not connect it. But we end up with less queues than cpus without
any good reason.

I would have optimally want a different mapping that will use all
the queues:
cpu0 -> hctx0
cpu2 -> hctx1
cpu3 -> hctx2
* cpu1 -> hctx1 (doesn't matter, offline)

Something looks broken...


Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7

2018-04-08 Thread Sagi Grimberg



Hi Sagi

Still can reproduce this issue with the change:


Thanks for validating Yi,

Would it be possible to test the following:
--
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 75336848f7a7..81ced3096433 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
request_queue *q,
 return ERR_PTR(-EXDEV);
 }
 cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
+   if (cpu >= nr_cpu_ids) {
+   pr_warn("no online cpu for hctx %d\n", hctx_idx);
+   cpu = cpumask_first(alloc_data.hctx->cpumask);
+   }
 alloc_data.ctx = __blk_mq_get_ctx(q, cpu);

 rq = blk_mq_get_request(q, NULL, op, _data);
--
...



[  153.384977] BUG: unable to handle kernel paging request at
3a9ed053bd48
[  153.393197] IP: blk_mq_get_request+0x23e/0x390


Also would it be possible to provide gdb output of:

l *(blk_mq_get_request+0x23e)


nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
request from one specific hw queue, but there may not be all online CPUs
mapped to this hw queue.


Yes, this is what I suspect..


And the following patchset may fail this kind of allocation and avoid
the kernel oops.

https://marc.info/?l=linux-block=152318091025252=2


Thanks Ming,

But I don't want to fail the allocation, nvmf_connect_io_queue simply
needs a tag to issue the connect request, I much rather to take this
tag from an online cpu than failing it... We use this because we reserve
a tag per-queue for this, but in this case, I'd rather block until the
inflight tag complete than failing the connect.


Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7

2018-04-08 Thread Sagi Grimberg



Hi Sagi

Still can reproduce this issue with the change:


Thanks for validating Yi,

Would it be possible to test the following:
--
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 75336848f7a7..81ced3096433 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,

return ERR_PTR(-EXDEV);
}
cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
+   if (cpu >= nr_cpu_ids) {
+   pr_warn("no online cpu for hctx %d\n", hctx_idx);
+   cpu = cpumask_first(alloc_data.hctx->cpumask);
+   }
alloc_data.ctx = __blk_mq_get_ctx(q, cpu);

rq = blk_mq_get_request(q, NULL, op, _data);
--
...


[  153.384977] BUG: unable to handle kernel paging request at 
3a9ed053bd48

[  153.393197] IP: blk_mq_get_request+0x23e/0x390


Also would it be possible to provide gdb output of:

l *(blk_mq_get_request+0x23e)

Thanks,


Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7

2018-04-04 Thread Sagi Grimberg



On 03/30/2018 12:32 PM, Yi Zhang wrote:

Hello
I got this kernel BUG on 4.16.0-rc7, here is the reproducer and log, let me 
know if you need more info, thanks.

Reproducer:
1. setup target
#nvmetcli restore /etc/rdma.json
2. connect target on host
#nvme connect-all -t rdma -a $IP -s 4420during my NVMeoF RDMA testing
3. do fio background on host
#fio -filename=/dev/nvme0n1 -iodepth=1 -thread -rw=randwrite -ioengine=psync 
-bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 
-bs_unaligned -runtime=180 -size=-group_reporting -name=mytest -numjobs=60 &
4. offline cpu on host
#echo 0 > /sys/devices/system/cpu/cpu1/online
#echo 0 > /sys/devices/system/cpu/cpu2/online
#echo 0 > /sys/devices/system/cpu/cpu3/online
5. clear target
#nvmetcli clear
6. restore target
#nvmetcli restore /etc/rdma.json
7. check console log on host


Hi Yi,

Does this happen with this applied?
--
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f1de18..b89da55e8aaa 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -35,6 +35,8 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
const struct cpumask *mask;
unsigned int queue, cpu;

+   goto fallback;
+
for (queue = 0; queue < set->nr_hw_queues; queue++) {
mask = ib_get_vector_affinity(dev, first_vec + queue);
if (!mask)
--


Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

2018-03-05 Thread Sagi Grimberg



-   if (nvmeq->sq_cmds_io)
-   memcpy_toio(>sq_cmds_io[tail], cmd, sizeof(*cmd));
-   else
-   memcpy(>sq_cmds[tail], cmd, sizeof(*cmd));
+   memcpy(>sq_cmds[tail], cmd, sizeof(*cmd));


Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
the _toio() variant enforces alignment, does the copy with 4 byte
stores, and has a full barrier after the copy. In comparison our
regular memcpy() does none of those things and may use unaligned and
vector load/stores. For normal (cacheable) memory that is perfectly
fine, but they can cause alignment faults when targeted at MMIO
(cache-inhibited) memory.

I think in this particular case it might be ok since we know SEQs are
aligned to 64 byte boundaries and the copy is too small to use our
vectorised memcpy(). I'll assume we don't need explicit ordering
between writes of SEQs since the existing code doesn't seem to care
unless the doorbell is being rung, so you're probably fine there too.
That said, I still think this is a little bit sketchy and at the very
least you should add a comment explaining what's going on when the CMB
is being used. If someone more familiar with the NVMe driver could
chime in I would appreciate it.


I may not be understanding the concern, but I'll give it a shot.

You're right, the start of any SQE is always 64-byte aligned, so that
should satisfy alignment requirements.

The order when writing multiple/successive SQEs in a submission queue
does matter, and this is currently serialized through the q_lock.

The order in which the bytes of a single SQE is written doesn't really
matter as long as the entire SQE is written into the CMB prior to writing
that SQ's doorbell register.

The doorbell register is written immediately after copying a command
entry into the submission queue (ignore "shadow buffer" features),
so the doorbells written to commands submitted is 1:1.

If a CMB SQE and DB order is not enforced with the memcpy, then we do
need a barrier after the SQE's memcpy and before the doorbell's writel.


Keith, while we're on this, regardless of cmb, is SQE memcopy and DB 
update ordering always guaranteed?


If you look at mlx4 (rdma device driver) that works exactly the same as
nvme you will find:
--
qp->sq.head += nreq;

/*
 * Make sure that descriptors are written before
 * doorbell record.
 */
wmb();

writel(qp->doorbell_qpn,
   to_mdev(ibqp->device)->uar_map + 
MLX4_SEND_DOORBELL);


/*
 * Make sure doorbells don't leak out of SQ spinlock
 * and reach the HCA out of order.
 */
mmiowb();
--

That seems to explicitly make sure to place a barrier before updating
the doorbell. So as I see it, either ordering is guaranteed and the
above code is redundant, or nvme needs to do the same.

Thoughts?


Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Sagi Grimberg



On 01/03/18 04:03 AM, Sagi Grimberg wrote:

Can you describe what would be the plan to have it when these devices
do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
and not from nvmet_ctrl. Then, when cmb capable devices come along, the
ns can prefer to use its own cmb instead of locating a p2p_dev device?


The patchset already supports CMB drives. That's essentially what patch 
7 is for. We change the nvme-pci driver to use the p2pmem code to 
register and manage the CMB memory. After that, it is simply available 
to the nvmet code. We have already been using this with a couple 
prototype CMB drives.


The comment was to your statement:
"Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available."

Maybe its a left-over which confused me...

Anyways, my question still holds. If I rack several of these
nvme drives, ideally we would use _their_ cmbs for I/O that is
directed to these namespaces. This is why I was suggesting that
p2p_dev should live in nvmet_ns and not in nvmet_ctrl as a single
p2p_dev used by all namespaces.

nvmet_ns seems like a more natural place to host p2p_dev with
cmb in mind.


+static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
+   struct nvmet_ns *ns)
+{
+    int ret;
+
+    if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
+    pr_err("peer-to-peer DMA is not supported by %s\n",
+   ns->device_path);
+    return -EINVAL;


I'd say that just skip it instead of failing it, in theory one can
connect nvme devices via p2p mem and expose other devices in the
same subsystem. The message would be pr_debug to reduce chattiness.


No, it's actually a bit more complicated than you make it. There's a 
couple cases:


1) The user adds a namespace but there hasn't been a connection and no 
p2pmem has been selected. In this case the add_client function is never 
called and the user can add whatever namespace they like.


2) When the first connect happens, nvmet_setup_p2pmem() will call 
add_client() for each namespace and rdma device. If any of them fail 
then it does not use a P2P device and falls back, as you'd like, to 
regular memory.


3) When the user adds a namespace after a port is in use and a 
compatible P2P device has been found. In this case, if the user tries to 
add a namespace that is not compatible with the P2P device in use then 
it fails adding the new namespace. The only alternative here is to tear 
everything down, ensure there are no P2P enabled buffers open and start 
using regular memory again... That is very difficult.




Wouldn't it all be simpler if the p2p_dev resolution would be private
to the namespace?

So is adding some all the namespaces in a subsystem must comply to
using p2p? Seems a little bit harsh if its not absolutely needed. Would
be nice to export a subsystems between two ports (on two HCAs, across
NUMA nodes) where the home node (primary path) would use p2p and
failover would use host memory...

Can you help me understand why this is absolutely not feasible?

I also disagree that these messages should be pr_debug. If a user is 
trying to use P2P memory and they enable it but the system doesn't 
choose to use their memory they must know why that is so they can make 
the necessary adjustments. If the system doesn't at least print a dmesg 
they are in the dark.


I was arguing that the user might have intentionally wanted to use p2p
where possible but still expose other namespaces over host memory. For
this user, the messages are spam. I guess info/warn could also suffice
(assuming we allow it, if we fail it then no point of the debug level
discussion).




+    }
+
+    ret = pci_p2pdma_add_client(>p2p_clients, nvmet_ns_dev(ns));
+    if (ret)
+    pr_err("failed to add peer-to-peer DMA client %s: %d\n",
+   ns->device_path, ret);
+
+    return ret;
+}
+
  int nvmet_ns_enable(struct nvmet_ns *ns)
  {
  struct nvmet_subsys *subsys = ns->subsys;
@@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
  if (ret)
  goto out_blkdev_put;
+    list_for_each_entry(ctrl, >ctrls, subsys_entry) {
+    if (ctrl->p2p_dev) {
+    ret = nvmet_p2pdma_add_client(ctrl, ns);
+    if (ret)
+    goto out_remove_clients;


Is this really a fatal failure given that we fall-back to main
memory? Why not continue with main memory (and warn at best)?


See above. It's fatal because we are already using p2p memory and we 
can't easily tear that all down when a user adds a new namespace.


In my mind, I/O to this namespace would use host memory and be done
with it. I guess I still don't understand why this is not possible.


+    ctrl->p2p_dev = pci_p2pmem_find(>p2p_clients);


This is the first p2p_dev found

Re: [PATCH v2 05/10] block: Introduce PCI P2P flags for request and request queue

2018-03-01 Thread Sagi Grimberg

Looks fine,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>


Re: [PATCH v2 08/10] nvme-pci: Add support for P2P memory in requests

2018-03-01 Thread Sagi Grimberg



For P2P requests we must use the pci_p2pmem_[un]map_sg() functions
instead of the dma_map_sg functions.

With that, we can then indicate PCI_P2P support in the request queue.
For this, we create an NVME_F_PCI_P2P flag which tells the core to
set QUEUE_FLAG_PCI_P2P in the request queue.


This looks fine to me,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>

Any plans adding the capability to nvme-rdma? Should be
straight-forward... In theory, the use-case would be rdma backend
fabric behind. Shouldn't be hard to test either...


Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory

2018-03-01 Thread Sagi Grimberg



We create a configfs attribute in each nvme-fabrics target port to
enable p2p memory use. When enabled, the port will only then use the
p2p memory if a p2p memory device can be found which is behind the
same switch as the RDMA port and all the block devices in use. If
the user enabled it an no devices are found, then the system will
silently fall back on using regular memory.

If appropriate, that port will allocate memory for the RDMA buffers
for queues from the p2pmem device falling back to system memory should
anything fail.


Nice.


Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
save an extra PCI transfer as the NVME card could just take the data
out of it's own memory. However, at this time, cards with CMB buffers
don't seem to be available.


Can you describe what would be the plan to have it when these devices
do come along? I'd say that p2p_dev needs to become a nvmet_ns reference
and not from nvmet_ctrl. Then, when cmb capable devices come along, the
ns can prefer to use its own cmb instead of locating a p2p_dev device?


+static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl,
+  struct nvmet_ns *ns)
+{
+   int ret;
+
+   if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
+   pr_err("peer-to-peer DMA is not supported by %s\n",
+  ns->device_path);
+   return -EINVAL;


I'd say that just skip it instead of failing it, in theory one can
connect nvme devices via p2p mem and expose other devices in the
same subsystem. The message would be pr_debug to reduce chattiness.


+   }
+
+   ret = pci_p2pdma_add_client(>p2p_clients, nvmet_ns_dev(ns));
+   if (ret)
+   pr_err("failed to add peer-to-peer DMA client %s: %d\n",
+  ns->device_path, ret);
+
+   return ret;
+}
+
  int nvmet_ns_enable(struct nvmet_ns *ns)
  {
struct nvmet_subsys *subsys = ns->subsys;
@@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
if (ret)
goto out_blkdev_put;
  
+	list_for_each_entry(ctrl, >ctrls, subsys_entry) {

+   if (ctrl->p2p_dev) {
+   ret = nvmet_p2pdma_add_client(ctrl, ns);
+   if (ret)
+   goto out_remove_clients;


Is this really a fatal failure given that we fall-back to main
memory? Why not continue with main memory (and warn at best)?


+/*
+ * If allow_p2pmem is set, we will try to use P2P memory for the SGL lists for
+ * Ι/O commands. This requires the PCI p2p device to be compatible with the
+ * backing device for every namespace on this controller.
+ */
+static void nvmet_setup_p2pmem(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
+{
+   struct nvmet_ns *ns;
+   int ret;
+
+   if (!req->port->allow_p2pmem || !req->p2p_client)
+   return;
+
+   mutex_lock(>subsys->lock);
+
+   ret = pci_p2pdma_add_client(>p2p_clients, req->p2p_client);
+   if (ret) {
+   pr_err("failed adding peer-to-peer DMA client %s: %d\n",
+  dev_name(req->p2p_client), ret);
+   goto free_devices;
+   }
+
+   list_for_each_entry_rcu(ns, >subsys->namespaces, dev_link) {
+   ret = nvmet_p2pdma_add_client(ctrl, ns);
+   if (ret)
+   goto free_devices;
+   }
+
+   ctrl->p2p_dev = pci_p2pmem_find(>p2p_clients);


This is the first p2p_dev found right? What happens if I have more than
a single p2p device? In theory I'd have more p2p memory I can use. Have
you considered making pci_p2pmem_find return the least used suitable
device?


+   if (!ctrl->p2p_dev) {
+   pr_info("no supported peer-to-peer memory devices found\n");
+   goto free_devices;
+   }
+   mutex_unlock(>subsys->lock);
+
+   pr_info("using peer-to-peer memory on %s\n", pci_name(ctrl->p2p_dev));
+   return;
+
+free_devices:
+   pci_p2pdma_client_list_free(>p2p_clients);
+   mutex_unlock(>subsys->lock);
+}
+
+static void nvmet_release_p2pmem(struct nvmet_ctrl *ctrl)
+{
+   if (!ctrl->p2p_dev)
+   return;
+
+   mutex_lock(>subsys->lock);
+
+   pci_p2pdma_client_list_free(>p2p_clients);
+   pci_dev_put(ctrl->p2p_dev);
+   ctrl->p2p_dev = NULL;
+
+   mutex_unlock(>subsys->lock);
+}
+
  u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn,
struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp)
  {
@@ -800,6 +890,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char 
*hostnqn,
  
  	INIT_WORK(>async_event_work, nvmet_async_event_work);

INIT_LIST_HEAD(>async_events);
+   INIT_LIST_HEAD(>p2p_clients);
  
  	memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE);

memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE);
@@ -855,6 +946,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char 
*hostnqn,
ctrl->kato = 

Re: [PATCH v2 09/10] nvme-pci: Add a quirk for a pseudo CMB

2018-03-01 Thread Sagi Grimberg

Looks fine,

Reviewed-by: Sagi Grimberg <s...@grimberg.me>


Re: [PATCH v2 06/10] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()

2018-03-01 Thread Sagi Grimberg



On 03/01/2018 01:40 AM, Logan Gunthorpe wrote:

In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be
called to map the correct DMA address. To do this, we add a flags
variable and the RDMA_RW_CTX_FLAG_PCI_P2P flag. When the flag is
specified use the appropriate map function.

Signed-off-by: Logan Gunthorpe 
---
  drivers/infiniband/core/rw.c| 21 +
  drivers/infiniband/ulp/isert/ib_isert.c |  5 +++--
  drivers/infiniband/ulp/srpt/ib_srpt.c   |  7 ---
  drivers/nvme/target/rdma.c  |  6 +++---
  include/rdma/rw.h   |  7 +--
  net/sunrpc/xprtrdma/svc_rdma_rw.c   |  6 +++---
  6 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index c8963e91f92a..775a9f8b15a6 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -12,6 +12,7 @@
   */
  #include 
  #include 
+#include 
  #include 
  #include 
  
@@ -269,18 +270,24 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,

   * @remote_addr:remote address to read/write (relative to @rkey)
   * @rkey: remote key to operate on
   * @dir:  %DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ * @flags:  any of the RDMA_RW_CTX_FLAG_* flags
   *
   * Returns the number of WQEs that will be needed on the workqueue if
   * successful, or a negative error code.
   */
  int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
-   u64 remote_addr, u32 rkey, enum dma_data_direction dir)
+   u64 remote_addr, u32 rkey, enum dma_data_direction dir,
+   unsigned int flags)
  {
struct ib_device *dev = qp->pd->device;
int ret;
  
-	ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);

+   if (flags & RDMA_RW_CTX_FLAG_PCI_P2PDMA)
+   ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
+   else
+   ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+


Why not use is_pci_p2pdma_page(sg) instead of a flag? It would be so
much cleaner...


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Sagi Grimberg



Hi Everyone,


Hi Logan,


Here's v2 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.16-rc3 which already
includes Christoph's devpagemap work the previous version was based
off as well as a couple of the cleanup patches that were in v1.

Additionally, we've made the following changes based on feedback:

* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
   as a bunch of cleanup and spelling fixes he pointed out in the last
   series.

* To address Alex's ACS concerns, we change to a simpler method of
   just disabling ACS behind switches for any kernel that has
   CONFIG_PCI_P2PDMA.

* We also reject using devices that employ 'dma_virt_ops' which should
   fairly simply handle Jason's concerns that this work might break with
   the HFI, QIB and rxe drivers that use the virtual ops to implement
   their own special DMA operations.


That's good, but what would happen for these devices? simply fail the
mapping causing the ulp to fail its rdma operation? I would think
that we need a capability flag for devices that support it.


Re: [PATCH rfc 3/5] irq_poll: wire up irq_am

2018-02-12 Thread Sagi Grimberg

Hey Bart,


+void irq_poll_init_am(struct irq_poll *iop, unsigned int nr_events,
+unsigned short nr_levels, unsigned short start_level, irq_poll_am_fn 
*amfn)
+{
+   iop->amfn = amfn;
+   irq_am_init(>am, nr_events, nr_levels, start_level, irq_poll_am);
+}


This function has a large number of parameters and most of them are passed 
verbatim to
irq_am_init(). Please consider to introduce a structure for these parameters 
such that
the number of function arguments stays reasonable.


I can definitely change that.


Re: [PATCH] blk: optimization for classic polling

2018-02-08 Thread Sagi Grimberg



I think it'd be simpler to have blk_poll set it back to running if
need_resched is true rather than repeat this patter across all the
callers:

---
diff --git a/block/blk-mq.c b/block/blk-mq.c
index df93102e2149..40285fe1c8ad 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3164,6 +3164,7 @@ static bool __blk_mq_poll(struct blk_mq_hw_ctx *hctx, 
struct request *rq)
cpu_relax();
}
  
+	set_current_state(TASK_RUNNING);

return false;
  }
  
--


Nice!


Re: [PATCH rfc 0/5] generic adaptive IRQ moderation library for I/O devices

2018-02-06 Thread Sagi Grimberg

Hi Tal,


I think Tal has idea/s on how the existing library can be changed to
support more modes/models

What I was thinking is allowing DIM algorithm to disregard data which is 
0. Currently if bytes == 0 we return "SAME" immediately. We can change 
it to simply move to the packets check (which may be renamed to 
"completions"). This way you could use DIM while only optimizing to (P1) 
high packet rate and (P2) low interrupt rate.


That was exactly where I started from. But unfortunately it did not work
well :(

From my experiments, the moderation was all over the place failing to
converge. At least the workloads that I've tested with, it was more
successful to have a stricter step policy and pulling towards latency
if we are consistently catching single completion per event.

I'm not an expert here at all, but at this point, based on my attempts
so far, I'm not convinced the current net_dim scheme could work.


Re: [PATCH rfc 0/5] generic adaptive IRQ moderation library for I/O devices

2018-02-06 Thread Sagi Grimberg

Hey Or,


not any more, Andy and Tal made the mlx5 AM code a kernel library
which is called DIM

f4e5f0e MAINTAINERS: add entry for Dynamic Interrupt Moderation
6a8788f bnxt_en: add support for software dynamic interrupt moderation
8115b75 net/dim: use struct net_dim_sample as arg to net_dim
4c4dbb4 net/mlx5e: Move dynamic interrupt coalescing code to include/linux
9a31742 net/mlx5e: Change Mellanox references in DIM code
b9c872f net/mlx5e: Move generic functions to new file
f5e7f67 net/mlx5e: Move AM logic enums
138968e net/mlx5e: Remove rq references in mlx5e_rx_am
f58ee09 net/mlx5e: Move interrupt moderation forward declarations
98dd1ed net/mlx5e: Move interrupt moderation structs to new file

can you make use of that? cc-ing them in case you have questions/comments


Thanks a lot for the pointer, I'm not following net-next regularly.
This seems to be a copy of the mlx5 code.

So as mentioned in the cover letter, the implementation was inspired
from the mlx5 driver as it had some of the abstraction properties I was
aiming to achieve.

What I didn't mention, was that I started from using the mlx5
implementation as is (slightly modified). Unfortunately in the
workloads that I've tested, I was not able to get the am level to
converge. So I went in a slightly different direction which gave me
better results (although still not perfect and lightly tested).

This led me to believe that mlx5 offered strategy might not suite
storage I/O workloads (although I do acknowledge that I could be proven
wrong here).

For example, a design choice I took was that the moderation scheme would
be more aggressive towards latency on the expense of throughput
optimization since high end storage devices are often expected to meet
strict latency requirements. Does that makes sense for network devices?
I don't know.

Another difference is that unlike net devices, storage I/O transactions
usually look like request/response where the data transfer is offloaded
by the controller/hba (net devices moderate on raw RX datagrams).
Moreover, it might not be trivial to track the bytes/sec from a
completion at all (might be embedded somewhere in the consumer context).

So overall, at this point I'm a bit skeptical that it will makes sense
to commonise the two implementations. I'd like to hear some more input
if this is something that the community is interested in first. If this
is something people are interested in, we can look if it makes sense to
reuse net_dim.h or not.


[PATCH rfc 1/5] irq-am: Introduce library implementing generic adaptive moderation

2018-02-05 Thread Sagi Grimberg
irq-am library helps I/O devices implement interrupt moderation in
an adaptive fashion, based on online stats.

The consumer can initialize an irq-am context with a callback that
performs the device specific moderation programming and also the number
of am (adaptive moderation) levels which are also, abstracted and allows
for device specific tuning.

The irq-am code will sample once every nr_events and will check for significant
change in workload characteristics (completions per second, events per second)
and if it detects one, will perform an am level update(called a step).

The irq-am code  assumes that the am levels are sorted in an increasing order 
when
the lowest level corresponds to the optimum latency tuning (short time and low
completion-count) and gradually increasing towards the throughput optimum tuning
(longer time and higher completion-count). So there is a trend and tuning 
direction
tracked by the moderator. When the moderator collects sufficient statistics 
(also
controlled by the consumer defining nr_events), it compares the current stats 
with the
previous stats and if a significant changed was observed in the load, the 
moderator
attempts to increment/decrement its current level (step) and schedules a program
dispatch work.

Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 include/linux/irq-am.h | 116 +++
 lib/Kconfig|   5 ++
 lib/Makefile   |   1 +
 lib/irq-am.c   | 182 +
 4 files changed, 304 insertions(+)
 create mode 100644 include/linux/irq-am.h
 create mode 100644 lib/irq-am.c

diff --git a/include/linux/irq-am.h b/include/linux/irq-am.h
new file mode 100644
index ..5ddd5ca268aa
--- /dev/null
+++ b/include/linux/irq-am.h
@@ -0,0 +1,116 @@
+/*
+ * Adaptive moderation support for I/O devices.
+ * Copyright (c) 2018 Lightbits Labs.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#ifndef _IRQ_AM_H
+#define _IRQ_AM_H
+
+#include 
+#include 
+
+struct irq_am;
+typedef int (irq_am_fn)(struct irq_am *, unsigned short level);
+
+/*
+ * struct irq_am_sample_stats - sample stats for adpative moderation
+ * @cps:completions per-second
+ * @eps:events per-second
+ * @cpe:   completions per event
+ */
+struct irq_am_sample_stats {
+   u32 cps;
+   u32 eps;
+   u32 cpe;
+};
+
+/*
+ * struct irq_am_sample - per-irq interrupt batch sample unit
+ * @time: current time
+ * @comps: completions count since last sample
+ * @events:events count since the last sample
+ */
+struct irq_am_sample {
+   ktime_t time;
+   u64 comps;
+   u64 events;
+};
+
+/*
+ * enum irq_am_state - adaptive moderation monitor states
+ * @IRQ_AM_START_MEASURING:collect first sample (start_sample)
+ * @IRQ_AM_MEASURING:  measurement in progress
+ * @IRQ_AM_PROGRAM_MODERATION: moderatio program scheduled
+ * so we should not react to any stats
+ * from the old moderation profile.
+ */
+enum irq_am_state {
+   IRQ_AM_START_MEASURING,
+   IRQ_AM_MEASURING,
+   IRQ_AM_PROGRAM_MODERATION,
+};
+
+enum irq_am_tune_state {
+   IRQ_AM_GOING_UP,
+   IRQ_AM_GOING_DOWN,
+};
+
+enum irq_am_relative_diff {
+   IRQ_AM_STATS_WORSE,
+   IRQ_AM_STATS_SAME,
+   IRQ_AM_STATS_BETTER,
+};
+
+struct irq_am_stats {
+   u64 events;
+   u64 comps;
+};
+
+/*
+ * struct irq_am - irq adaptive moderation monitor
+ * @state: adaptive moderation monitor state
+ * @tune_state:tuning state of the moderation monitor
+ * @am_stats:  overall completions and events counters
+ * @start_sample:  first sample in moderation batch
+ * @prev_stats:previous stats for trend detection
+ * @nr_events: number of events between samples
+ * @nr_levels: number of moderation levels
+ * @curr_level:current moderation level
+ * @work:  schedule moderation program
+ * @program:   moderation program handler
+ */
+struct irq_am {
+   enum irq_am_state   state;
+   enum irq_am_tune_state  tune_state;
+
+   struct irq_am_stats am_stats;
+   struct irq_am_samplestart_sample;
+   struct irq_am_sample_stats  prev_stats;
+
+   u16 nr_events;
+   unsigned short  nr_levels;
+   unsigned short  curr_level;
+
+   struct work_

[PATCH rfc 2/5] irq-am: add some debugfs exposure on tuning state

2018-02-05 Thread Sagi Grimberg
Useful for local debugging

Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 include/linux/irq-am.h |   2 +
 lib/irq-am.c   | 109 +
 2 files changed, 111 insertions(+)

diff --git a/include/linux/irq-am.h b/include/linux/irq-am.h
index 5ddd5ca268aa..18df315633a4 100644
--- a/include/linux/irq-am.h
+++ b/include/linux/irq-am.h
@@ -101,6 +101,8 @@ struct irq_am {
 
struct work_struct  work;
irq_am_fn   *program;
+   struct dentry   *debugfs_dir;
+   int id;
 };
 
 void irq_am_add_event(struct irq_am *am);
diff --git a/lib/irq-am.c b/lib/irq-am.c
index ed7befd7a560..6cd2f131f5e8 100644
--- a/lib/irq-am.c
+++ b/lib/irq-am.c
@@ -12,6 +12,61 @@
  * more details.
  */
 #include 
+#include 
+
+static DEFINE_IDA(am_ida);
+
+#ifdef CONFIG_DEBUG_FS
+struct dentry *irq_am_debugfs_root;
+
+struct irq_am_debugfs_attr {
+   const char *name;
+   umode_t mode;
+   int (*show)(void *, struct seq_file *);
+   ssize_t (*write)(void *, const char __user *, size_t, loff_t *);
+};
+
+static int irq_am_tune_state_show(void *data, struct seq_file *m)
+{
+   struct irq_am *am = data;
+
+   seq_printf(m, "%d\n", am->tune_state);
+   return 0;
+}
+
+static int irq_am_curr_level_show(void *data, struct seq_file *m)
+{
+   struct irq_am *am = data;
+
+   seq_printf(m, "%d\n", am->curr_level);
+   return 0;
+}
+
+static int irq_am_debugfs_show(struct seq_file *m, void *v)
+{
+   const struct irq_am_debugfs_attr *attr = m->private;
+   void *data = d_inode(m->file->f_path.dentry->d_parent)->i_private;
+
+   return attr->show(data, m);
+}
+
+static int irq_am_debugfs_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, irq_am_debugfs_show, inode->i_private);
+}
+
+static const struct file_operations irq_am_debugfs_fops = {
+   .open   = irq_am_debugfs_open,
+   .read   = seq_read,
+   .release= seq_release,
+};
+
+static const struct irq_am_debugfs_attr irq_am_attrs[] = {
+   {"tune_state", 0400, irq_am_tune_state_show},
+   {"curr_level", 0400, irq_am_curr_level_show},
+   {},
+};
+#endif
 
 static void irq_am_try_step(struct irq_am *am)
 {
@@ -160,10 +215,51 @@ static void irq_am_program_moderation_work(struct 
work_struct *w)
am->state = IRQ_AM_START_MEASURING;
 }
 
+#ifdef CONFIG_DEBUG_FS
+static bool debugfs_create_files(struct dentry *parent, void *data,
+const struct irq_am_debugfs_attr *attr)
+{
+   d_inode(parent)->i_private = data;
+
+   for (; attr->name; attr++) {
+   if (!debugfs_create_file(attr->name, attr->mode, parent,
+(void *)attr, _am_debugfs_fops))
+   return false;
+   }
+   return true;
+}
+
+static int irq_am_register_debugfs(struct irq_am *am)
+{
+   char name[20];
+
+   snprintf(name, sizeof(name), "am%u", am->id);
+   am->debugfs_dir = debugfs_create_dir(name,
+   irq_am_debugfs_root);
+   if (!am->debugfs_dir)
+   return -ENOMEM;
+
+   if (!debugfs_create_files(am->debugfs_dir, am,
+   irq_am_attrs))
+   return -ENOMEM;
+   return 0;
+}
+
+static void irq_am_deregister_debugfs(struct irq_am *am)
+{
+   debugfs_remove_recursive(am->debugfs_dir);
+}
+
+#else
+static int irq_am_register_debugfs(struct irq_am *am) {}
+static void irq_am_deregister_debugfs(struct irq_am *am) {}
+#endif
 
 void irq_am_cleanup(struct irq_am *am)
 {
flush_work(>work);
+   irq_am_deregister_debugfs(am);
+   ida_simple_remove(_ida, am->id);
 }
 EXPORT_SYMBOL_GPL(irq_am_cleanup);
 
@@ -177,6 +273,19 @@ void irq_am_init(struct irq_am *am, unsigned int nr_events,
am->nr_events = nr_events;
am->curr_level = start_level;
am->program = fn;
+   am->id = ida_simple_get(_ida, 0, 0, GFP_KERNEL);
+   WARN_ON(am->id < 0);
INIT_WORK(>work, irq_am_program_moderation_work);
+   if (irq_am_register_debugfs(am))
+   pr_warn("irq-am %d failed to register debugfs\n", am->id);
 }
 EXPORT_SYMBOL_GPL(irq_am_init);
+
+static __init int irq_am_setup(void)
+{
+#ifdef CONFIG_DEBUG_FS
+   irq_am_debugfs_root = debugfs_create_dir("irq_am", NULL);
+#endif
+   return 0;
+}
+subsys_initcall(irq_am_setup);
-- 
2.14.1



[PATCH rfc 3/5] irq_poll: wire up irq_am

2018-02-05 Thread Sagi Grimberg
Update online stats for fired event and completions on
each poll cycle.

Also expose am initialization interface. The irqpoll consumer will
initialize the irq-am context of the irq-poll context.

Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 include/linux/irq_poll.h |  9 +
 lib/Kconfig  |  1 +
 lib/irq_poll.c   | 30 +-
 3 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/include/linux/irq_poll.h b/include/linux/irq_poll.h
index 16aaeccb65cb..de3055359fd8 100644
--- a/include/linux/irq_poll.h
+++ b/include/linux/irq_poll.h
@@ -2,14 +2,20 @@
 #ifndef IRQ_POLL_H
 #define IRQ_POLL_H
 
+#include 
+
 struct irq_poll;
 typedef int (irq_poll_fn)(struct irq_poll *, int);
+typedef int (irq_poll_am_fn)(struct irq_poll *, unsigned short);
 
 struct irq_poll {
struct list_head list;
unsigned long state;
int weight;
irq_poll_fn *poll;
+
+   struct irq_am am;
+   irq_poll_am_fn *amfn;
 };
 
 enum {
@@ -23,4 +29,7 @@ extern void irq_poll_complete(struct irq_poll *);
 extern void irq_poll_enable(struct irq_poll *);
 extern void irq_poll_disable(struct irq_poll *);
 
+extern void irq_poll_init_am(struct irq_poll *iop, unsigned int nr_events,
+unsigned short nr_levels, unsigned short start_level,
+   irq_poll_am_fn *amfn);
 #endif
diff --git a/lib/Kconfig b/lib/Kconfig
index bbb4c9eea84d..d495b21cd241 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -511,6 +511,7 @@ config IRQ_AM
 
 config IRQ_POLL
bool "IRQ polling library"
+   select IRQ_AM
help
  Helper library to poll interrupt mitigation using polling.
 
diff --git a/lib/irq_poll.c b/lib/irq_poll.c
index 86a709954f5a..6bc86a677d8c 100644
--- a/lib/irq_poll.c
+++ b/lib/irq_poll.c
@@ -53,6 +53,7 @@ static void __irq_poll_complete(struct irq_poll *iop)
list_del(>list);
smp_mb__before_atomic();
clear_bit_unlock(IRQ_POLL_F_SCHED, >state);
+   irq_am_add_event(>am);
 }
 
 /**
@@ -106,8 +107,10 @@ static void __latent_entropy irq_poll_softirq(struct 
softirq_action *h)
 
weight = iop->weight;
work = 0;
-   if (test_bit(IRQ_POLL_F_SCHED, >state))
+   if (test_bit(IRQ_POLL_F_SCHED, >state)) {
work = iop->poll(iop, weight);
+   irq_am_add_comps(>am, work);
+   }
 
budget -= work;
 
@@ -144,6 +147,7 @@ static void __latent_entropy irq_poll_softirq(struct 
softirq_action *h)
  **/
 void irq_poll_disable(struct irq_poll *iop)
 {
+   irq_am_cleanup(>am);
set_bit(IRQ_POLL_F_DISABLE, >state);
while (test_and_set_bit(IRQ_POLL_F_SCHED, >state))
msleep(1);
@@ -185,6 +189,30 @@ void irq_poll_init(struct irq_poll *iop, int weight, 
irq_poll_fn *poll_fn)
 }
 EXPORT_SYMBOL(irq_poll_init);
 
+static int irq_poll_am(struct irq_am *am, unsigned short level)
+{
+   struct irq_poll *iop = container_of(am, struct irq_poll, am);
+
+   return iop->amfn(iop, level);
+}
+
+/**
+ * irq_poll_init_am - Initialize adaptive moderation parameters on this @iop
+ * @iop:  The parent iopoll structure
+ * @weight:   The default weight (or command completion budget)
+ * @poll_fn:  The handler to invoke
+ *
+ * Description:
+ * Initialize adaptive moderation for this irq_poll structure.
+ **/
+void irq_poll_init_am(struct irq_poll *iop, unsigned int nr_events,
+unsigned short nr_levels, unsigned short start_level, irq_poll_am_fn 
*amfn)
+{
+   iop->amfn = amfn;
+   irq_am_init(>am, nr_events, nr_levels, start_level, irq_poll_am);
+}
+EXPORT_SYMBOL(irq_poll_init_am);
+
 static int irq_poll_cpu_dead(unsigned int cpu)
 {
/*
-- 
2.14.1



[PATCH rfc 4/5] IB/cq: add adaptive moderation support

2018-02-05 Thread Sagi Grimberg
Currently activated via modparam, obviously we will want to
find a more generic way to control this.

Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 drivers/infiniband/core/cq.c | 48 
 1 file changed, 48 insertions(+)

diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index f2ae75fa3128..270801d28f9d 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -25,6 +25,51 @@
 #define IB_POLL_FLAGS \
(IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)
 
+#define IB_CQ_AM_NR_LEVELS 8
+#define IB_CQ_AM_NR_EVENTS 64
+
+struct ib_cq_moder {
+   u16 usec;
+   u16 comps;
+};
+
+/* XXX: magic adaptive moderation profiles */
+static const struct ib_cq_moder am_prof[IB_CQ_AM_NR_LEVELS] = {
+   {1,  1},
+   {1,  2},
+   {2,  4},
+   {4,  8},
+   {8, 16},
+   {16, 32},
+   {32, 48},
+   {32, 64},
+};
+
+static bool use_am = true;
+module_param(use_am, bool, 0444);
+MODULE_PARM_DESC(use_am, "Use cq adaptive moderation");
+
+static int ib_cq_am(struct ib_cq *cq, unsigned short level)
+{
+   u16 usec;
+   u16 comps;
+
+   if (!use_am)
+   return 0;
+
+   usec = am_prof[level].usec;
+   comps = am_prof[level].comps;
+
+   return cq->device->modify_cq(cq, comps, usec);
+}
+
+static int ib_cq_irqpoll_am(struct irq_poll *iop, unsigned short level)
+{
+   struct ib_cq *cq = container_of(iop, struct ib_cq, iop);
+
+   return ib_cq_am(cq, level);
+}
+
 static int __ib_process_cq(struct ib_cq *cq, int budget)
 {
int i, n, completed = 0;
@@ -162,6 +207,9 @@ struct ib_cq *ib_alloc_cq(struct ib_device *dev, void 
*private,
cq->comp_handler = ib_cq_completion_softirq;
 
irq_poll_init(>iop, IB_POLL_BUDGET_IRQ, ib_poll_handler);
+   if (cq->device->modify_cq)
+   irq_poll_init_am(>iop, IB_CQ_AM_NR_EVENTS,
+   IB_CQ_AM_NR_LEVELS, 0, ib_cq_irqpoll_am);
ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
break;
case IB_POLL_WORKQUEUE:
-- 
2.14.1



[PATCH rfc 0/5] generic adaptive IRQ moderation library for I/O devices

2018-02-05 Thread Sagi Grimberg
, IOPs quickly converge to ~12M IOPs (at the 
expense of slightly
higher latencies obviously) and debugfs stats show that the moderation level 
reached the
throughput optimum level. There is currently a known issue I've observed in 
some conditions
converging back to latency optimum (after reaching throughput optimum am 
levels) and I'll work
to fix the tuning algorithm. Thanks to Idan Burstein for running some 
benchmarks on his
performance setup.

I've also tested this locally with my single core VMs and saw similar 
improvement of ~50%
in throughput in a similar workload (355 KIOPs vs. 235 KIOPs). More testing 
will help
a lot to confirm and improve the implementation.

QD=1 Latency tests showed a marginal regression of up to 2% in latency (lightly 
tested though).
The reason at this point is that the moderator still bounces in the low latency 
am levels
constantly (would like to improve that).

Another observed issue is the presence of user context polling (IOCB_HIPRI) 
which does not update
the irq_am stats (mainly because its not interrupt driven). This can cause the 
moderator to do
the wrong thing as its based on partial view of the load (optimize for latency 
instead of getting
out of the poller's way). However, recent discussions raised the possibility 
that polling requests
will be executed on a different set of queues with interrupts disabled 
altogether, which would make
this a non-issue.

None the less, I would like to get some initial feedback on the approach. Also, 
I'm not an expert
in tuning the algorithm. The basic approach was inspired by the mlx5 driver 
implementation which
seemed the closest to fit the abstraction level that I was aiming for. So I'd 
also love to get some
ideas on how to tune the algorithm better for various workloads (hence the RFC).

Lastly, I have also attempted to hook this into nvme (pcie), but that wasn't 
successful mainly
because the coalescing set_feature is global to the controller and not 
per-queue.
I'll be looking to bringing per-queue coalescing to the NVMe TWG (in case the 
community is
interested in supporting this).

Feedback would be highly appreciated, as well as a test drive with the code in 
case anyone
is interested :)

Sagi Grimberg (5):
  irq-am: Introduce helper library for adaptive moderation
implementation
  irq-am: add some debugfs exposure on tuning state
  irq_poll: wire up irq_am and allow to initialize it
  IB/cq: add adaptive moderation support
  IB/cq: wire up adaptive moderation to workqueue based completion
queues

 drivers/infiniband/core/cq.c |  73 ++-
 include/linux/irq-am.h   | 118 ++
 include/linux/irq_poll.h |   9 ++
 include/rdma/ib_verbs.h  |   9 +-
 lib/Kconfig  |   6 +
 lib/Makefile |   1 +
 lib/irq-am.c | 291 +++
 lib/irq_poll.c   |  30 -
 8 files changed, 529 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/irq-am.h
 create mode 100644 lib/irq-am.c

-- 
2.14.1



[PATCH rfc 5/5] IB/cq: wire up adaptive moderation to workqueue based completion queues

2018-02-05 Thread Sagi Grimberg
Signed-off-by: Sagi Grimberg <s...@grimberg.me>
---
 drivers/infiniband/core/cq.c | 25 -
 include/rdma/ib_verbs.h  |  9 +++--
 2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index 270801d28f9d..1d984fd77449 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* # of WCs to poll for with a single call to ib_poll_cq */
@@ -70,6 +71,13 @@ static int ib_cq_irqpoll_am(struct irq_poll *iop, unsigned 
short level)
return ib_cq_am(cq, level);
 }
 
+static int ib_cq_workqueue_am(struct irq_am *am, unsigned short level)
+{
+   struct ib_cq *cq = container_of(am, struct ib_cq, wq.am);
+
+   return ib_cq_am(cq, level);
+}
+
 static int __ib_process_cq(struct ib_cq *cq, int budget)
 {
int i, n, completed = 0;
@@ -147,18 +155,21 @@ static void ib_cq_completion_softirq(struct ib_cq *cq, 
void *private)
 
 static void ib_cq_poll_work(struct work_struct *work)
 {
-   struct ib_cq *cq = container_of(work, struct ib_cq, work);
+   struct ib_cq *cq = container_of(work, struct ib_cq, wq.work);
int completed;
 
completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
+   irq_am_add_comps(>wq.am, completed);
if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
-   queue_work(ib_comp_wq, >work);
+   queue_work(ib_comp_wq, >wq.work);
+   else
+   irq_am_add_event(>wq.am);
 }
 
 static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
 {
-   queue_work(ib_comp_wq, >work);
+   queue_work(ib_comp_wq, >wq.work);
 }
 
 /**
@@ -214,7 +225,10 @@ struct ib_cq *ib_alloc_cq(struct ib_device *dev, void 
*private,
break;
case IB_POLL_WORKQUEUE:
cq->comp_handler = ib_cq_completion_workqueue;
-   INIT_WORK(>work, ib_cq_poll_work);
+   INIT_WORK(>wq.work, ib_cq_poll_work);
+   if (cq->device->modify_cq)
+   irq_am_init(>wq.am, IB_CQ_AM_NR_EVENTS,
+   IB_CQ_AM_NR_LEVELS, 0, ib_cq_workqueue_am);
ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
break;
default:
@@ -250,7 +264,8 @@ void ib_free_cq(struct ib_cq *cq)
irq_poll_disable(>iop);
break;
case IB_POLL_WORKQUEUE:
-   cancel_work_sync(>work);
+   cancel_work_sync(>wq.work);
+   irq_am_cleanup(>wq.am);
break;
default:
WARN_ON_ONCE(1);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index fd84cda5ed7c..14a93566adb6 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1555,6 +1555,11 @@ enum ib_poll_context {
IB_POLL_WORKQUEUE,  /* poll from workqueue */
 };
 
+struct ib_cq_workqueue_poll {
+   struct irq_am   am;
+   struct work_struct  work;
+};
+
 struct ib_cq {
struct ib_device   *device;
struct ib_uobject  *uobject;
@@ -1566,8 +1571,8 @@ struct ib_cq {
enum ib_poll_contextpoll_ctx;
struct ib_wc*wc;
union {
-   struct irq_poll iop;
-   struct work_struct  work;
+   struct irq_poll iop;
+   struct ib_cq_workqueue_poll wq;
};
 };
 
-- 
2.14.1



Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Sagi Grimberg



Hi Bart,

My another 2 cents:)
On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche 
wrote:


On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:


o Simple configuration of IBNBD:
 - Server side is completely passive: volumes do not need to be
   explicitly exported.



That sounds like a security hole? I think the ability to configure
whether or
not an initiator is allowed to log in is essential and also which volumes
an
initiator has access to.


Our design target for well controlled production environment, so
security is handle in other layer.



What will happen to a new adopter of the code you are contributing?


Hi Sagi, Hi Bart,
thanks for your feedback.
We considered the "storage cluster" setup, where each ibnbd client has
access to each ibnbd server. Each ibnbd server manages devices under
his "dev_search_path" and can provide access to them to any ibnbd
client in the network.


I don't understand how that helps?


On top of that Ibnbd server has an additional
"artificial" restriction, that a device can be mapped in writable-mode
by only one client at once.


I think one would still need the option to disallow readable export as
well.


  1   2   3   4   5   >