RE: [PATCH v2] block: fix rdma queue mapping

2018-10-23 Thread Steve Wise



> -Original Message-
> From: Sagi Grimberg 
> Sent: Tuesday, October 23, 2018 4:25 PM
> To: Steve Wise ; 'Christoph Hellwig'
> 
> Cc: linux-block@vger.kernel.org; linux-r...@vger.kernel.org; linux-
> n...@lists.infradead.org; 'Max Gurtovoy' 
> Subject: Re: [PATCH v2] block: fix rdma queue mapping
> 
> 
> >>>> Christoph, Sagi:  it seems you think /proc/irq/$IRP/smp_affinity
> >>>> shouldn't be allowed if drivers support managed affinity. Is that
> correct?
> >>>
> >>> Not just shouldn't, but simply can't.
> >>>
> >>>> But as it stands, things are just plain borked if an rdma driver
> >>>> supports ib_get_vector_affinity() yet the admin changes the affinity via
> >>>> /proc...
> >>>
> >>> I think we need to fix ib_get_vector_affinity to not return anything
> >>> if the device doesn't use managed irq affinity.
> >>
> >> Steve, does iw_cxgb4 use managed affinity?
> >>
> >> I'll send a patch for mlx5 to simply not return anything as managed
> >> affinity is not something that the maintainers want to do.
> >
> > I'm beginning to think I don't know what "managed affinity" actually is.
> Currently iw_cxgb4 doesn't support ib_get_vector_affinity().  I have a patch
> for it, but ran into this whole issue with nvme failing if someone changes the
> affinity map via /proc.
> 
> That means that the pci subsystem gets your vector(s) affinity right and
> immutable. It also guarantees that you have reserved vectors and not get
> a best effort assignment when cpu cores are offlined.
> 
> You can simply enable it by adding PCI_IRQ_AFFINITY to
> pci_alloc_irq_vectors() or call pci_alloc_irq_vectors_affinity()
> to communicate post/pre vectors that don't participate in
> affinitization (nvme uses it for admin queue).
> 
> This way you can easily plug ->get_vector_affinity() to return
> pci_irq_get_affinity(dev, vector)
> 
> The original patch set from hch:
> https://lwn.net/Articles/693653/

Thanks for educating me.   I'll have a look into this. 

Steve.



RE: [PATCH v2] block: fix rdma queue mapping

2018-10-23 Thread Steve Wise
> >> Christoph, Sagi:  it seems you think /proc/irq/$IRP/smp_affinity
> >> shouldn't be allowed if drivers support managed affinity. Is that correct?
> >
> > Not just shouldn't, but simply can't.
> >
> >> But as it stands, things are just plain borked if an rdma driver
> >> supports ib_get_vector_affinity() yet the admin changes the affinity via
> >> /proc...
> >
> > I think we need to fix ib_get_vector_affinity to not return anything
> > if the device doesn't use managed irq affinity.
> 
> Steve, does iw_cxgb4 use managed affinity?
> 
> I'll send a patch for mlx5 to simply not return anything as managed
> affinity is not something that the maintainers want to do.

I'm beginning to think I don't know what "managed affinity" actually is.  
Currently iw_cxgb4 doesn't support ib_get_vector_affinity().  I have a patch 
for it, but ran into this whole issue with nvme failing if someone changes the 
affinity map via /proc.

Steve.



Re: [PATCH v2] block: fix rdma queue mapping

2018-10-03 Thread Steve Wise



On 8/24/2018 9:17 PM, Sagi Grimberg wrote:
> 
>>> nvme-rdma attempts to map queues based on irq vector affinity.
>>> However, for some devices, completion vector irq affinity is
>>> configurable by the user which can break the existing assumption
>>> that irq vectors are optimally arranged over the host cpu cores.
>>
>> IFF affinity is configurable we should never use this code,
>> as it breaks the model entirely.  ib_get_vector_affinity should
>> never return a valid mask if affinity is configurable.
> 
> I agree that the model intended initially doesn't fit. But it seems
> that some users like to write into their nic's
> /proc/irq/$IRP/smp_affinity and get mad at us for not letting them with
> using managed affinity.
> 
> So instead of falling back to the block mapping function we try
> to do a little better first:
> 1. map according to the device vector affinity
> 2. map vectors that end up without a mapping to cpus that belong
>    to the same numa-node
> 3. map all the rest of the unmapped cpus like the block layer
>    would do.
> 
> We could have device drivers that don't use managed affinity to never
> return a valid mask but that would never allow affinity based mapping
> which is optimal at least for users that do not mangle with device
> irq affinity (which is probably the majority of users).
> 
> Thoughts?

Can we please make forward progress on this?

Christoph, Sagi:  it seems you think /proc/irq/$IRP/smp_affinity
shouldn't be allowed if drivers support managed affinity. Is that correct?

Perhaps that can be codified and be a way forward?  IE: Somehow allow
the admin to choose either "managed by the driver/ulps" or "managed by
the system admin directly"?

Or just use Sagi's patch.  Perhaps a WARN_ONCE() if the affinity looks
wonked when set via procfs?  Just thinking out loud...

But as it stands, things are just plain borked if an rdma driver
supports ib_get_vector_affinity() yet the admin changes the affinity via
/proc...

Thanks,

Steve.


RE: [PATCH v2] block: fix rdma queue mapping

2018-08-25 Thread Steve Wise
> > I guess this way still can't fix the request allocation crash issue
> > triggered by using blk_mq_alloc_request_hctx(), in which one hw queue
> may
> > not be mapped from any online CPU.
> 
> Not really. I guess we will need to simply skip queues that are
> mapped to an offline cpu.
> 
> > Maybe this patch isn't for this issue, but it is closely related.
> 
> Yes, another patch is still needed.
> 
> Steve, do you still have that patch? I don't seem to
> find it anywhere.

I have no such patch.  I don't remember this issue.

Steve.



Re: [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma

2017-04-10 Thread Steve Wise

On 4/2/2017 8:41 AM, Sagi Grimberg wrote:

This patch set is aiming to automatically find the optimal
queue <-> irq multi-queue assignments in storage ULPs (demonstrated
on nvme-rdma) based on the underlying rdma device irq affinity
settings.

First two patches modify mlx5 core driver to use generic API
to allocate array of irq vectors with automatic affinity
settings instead of open-coding exactly what it does (and
slightly worse).

Then, in order to obtain an affinity map for a given completion
vector, we expose a new RDMA core API, and implement it in mlx5.

The third part is addition of a rdma-based queue mapping helper
to blk-mq that maps the tagset hctx's according to the device
affinity mappings.

I'd happily convert some more drivers, but I'll need volunteers
to test as I don't have access to any other devices.


I'll test cxgb4 if you convert it. :)



RE: [RFC PATCH 00/28] INFINIBAND NETWORK BLOCK DEVICE (IBNBD)

2017-03-24 Thread Steve Wise
> 
> From: Jack Wang 
> 
> This series introduces IBNBD/IBTRS kernel modules.
> 
> IBNBD (InfiniBand network block device) allows for an RDMA transfer of block
IO
> over InfiniBand network. The driver presents itself as a block device on
client
> side and transmits the block requests in a zero-copy fashion to the
server-side
> via InfiniBand. The server part of the driver converts the incoming buffers
back
> into BIOs and hands them down to the underlying block device. As soon as IO
> responses come back from the drive, they are being transmitted back to the
> client.

Hey Jack, why is this IB specific?  Can it work over iWARP transports as well?

Steve.





RE: A question regarding "multiple SGL"

2016-10-27 Thread Steve Wise
> > Hi Robert,
> 
> Hey Robert, Christoph,
> 
> > please explain your use cases that isn't handled.  The one and only
> > reason to set MSDBD to 1 is to make the code a lot simpler given that
> > there is no real use case for supporting more.
> >
> > RDMA uses memory registrations to register large and possibly
> > discontiguous data regions for a single rkey, aka single SGL descriptor
> > in NVMe terms.  There would be two reasons to support multiple SGL
> > descriptors:  a) to support a larger I/O size than supported by a single
> > MR, or b) to support a data region format not mappable by a single
> > MR.
> >
> > iSER only supports a single rkey (or stag in IETF terminology) and has
> > been doing fine on a) and mostly fine on b).   There are a few possible
> > data layouts not supported by the traditional IB/iWarp FR WRs, but the
> > limit is in fact exactly the same as imposed by the NVMe PRPs used for
> > PCIe NVMe devices, so the Linux block layer has support to not generate
> > them.  Also with modern Mellanox IB/RoCE hardware we can actually
> > register completely arbitrary SGLs.  iSER supports using this registration
> > mode already with a trivial code addition, but for NVMe we didn't have a
> > pressing need yet.
> 
> Good explanation :)
> 
> The IO transfer size is a bit more pressing on some devices (e.g.
> cxgb3/4) where the number of pages per-MR can be indeed lower than
> a reasonable transfer size (Steve can correct me if I'm wrong).
>

Currently, cxgb4 support 128KB REG_MR operations on a host with 4K page size,
via a max mr page list depth of 32.  Soon it will be bumped up from 32 to 128
and life will be better...

 
> However, if there is a real demand for this we'll happily accept
> patches :)
> 
> Just a note, having this feature in-place can bring unexpected behavior
> depending on how we implement it:
> - If we can use multiple MRs per IO (for multiple SGLs) we can either
> prepare for the worst-case and allocate enough MRs to satisfy the
> various IO patterns. This will be much heavier in terms of resource
> allocation and can limit the scalability of the host driver.
> - Or we can implement a shared MR pool with a reasonable number of MRs.
> In this case each IO can consume one or more MRs on the expense of
> other IOs. In this case we may need to requeue the IO later when we
> have enough available MRs to satisfy the IO. This can yield some
> unexpected performance gaps for some workloads.
> 

I would like to see the storage protocols deal with lack of resources for the
worst case.  This allows much smaller resource usage for both MRs, and SQ
resources, at the expense of adding flow control logic to deal with lack of
available MR and/or SQ slots to process the next IO.  I think it can be
implemented efficiently such that when in flow-control mode, the code is driving
new IO submissions off of SQ completions which will free up SQ slots and most
likely MRs from the QP's MR pool.

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 9/9] [RFC] nvme: Fix a race condition

2016-09-28 Thread Steve Wise
> 
> Hello James and Steve,
> 
> I will add a comment.
> 
> Please note that the above patch does not change the behavior of
> nvme_stop_queues() except that it causes nvme_stop_queues() to wait
> until any ongoing nvme_queue_rq() calls have finished.
> blk_resume_queue() does not affect the value of the BLK_MQ_S_STOPPED bit
> that has been set by blk_mq_stop_hw_queues(). All it does is to resume
> pending blk_queue_enter() calls and to ensure that future
> blk_queue_enter() calls do not block. Even after blk_resume_queue() has
> been called if a new request is queued queue_rq() won't be invoked
> because the BLK_MQ_S_STOPPED bit is still set. Patch "dm: Fix a race
> condition related to stopping and starting queues" realizes a similar
> change in the dm driver and that change has been tested extensively.
> 

Thanks for the detailed explanation!  I think your code, then, is correct 
as-is.   And this series doesn't fix the issue I'm hitting, so I'll keep 
digging. :)

Steve.  

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 9/9] [RFC] nvme: Fix a race condition

2016-09-27 Thread Steve Wise
> On 09/27/2016 09:31 AM, Steve Wise wrote:
> >> @@ -2079,11 +2075,15 @@ EXPORT_SYMBOL_GPL(nvme_kill_queues);
> >>  void nvme_stop_queues(struct nvme_ctrl *ctrl)
> >>  {
> >>struct nvme_ns *ns;
> >> +  struct request_queue *q;
> >>
> >>mutex_lock(>namespaces_mutex);
> >>list_for_each_entry(ns, >namespaces, list) {
> >> -  blk_mq_cancel_requeue_work(ns->queue);
> >> -  blk_mq_stop_hw_queues(ns->queue);
> >> +  q = ns->queue;
> >> +  blk_quiesce_queue(q);
> >> +  blk_mq_cancel_requeue_work(q);
> >> +  blk_mq_stop_hw_queues(q);
> >> +  blk_resume_queue(q);
> >>}
> >>mutex_unlock(>namespaces_mutex);
> >
> > Hey Bart, should nvme_stop_queues() really be resuming the blk queue?
> 
> Hello Steve,
> 
> Would you perhaps prefer that blk_resume_queue(q) is called from
> nvme_start_queues()? I think that would make the NVMe code harder to
> review. 

I'm still learning the blk code (and nvme code :)), but I would think
blk_resume_queue() would cause requests to start being submit on the NVME
queues, which I believe shouldn't happen when they are stopped.  I'm currently
debugging a problem where requests are submitted to the nvme-rdma driver while
it has supposedly stopped all the nvme and blk mqs.  I tried your series at
Christoph's request to see if it resolved my problem, but it didn't.  

> The above code won't cause any unexpected side effects if an
> NVMe namespace is removed after nvme_stop_queues() has been called and
> before nvme_start_queues() is called. Moving the blk_resume_queue(q)
> call into nvme_start_queues() will only work as expected if no
> namespaces are added nor removed between the nvme_stop_queues() and
> nvme_start_queues() calls. I'm not familiar enough with the NVMe code to
> know whether or not this change is safe ...
> 

I'll have to look and see if new namespaces can be added/deleted while a nvme
controller is in the RECONNECTING state.   In the meantime, I'm going to move
the blk_resume_queue() to nvme_start_queues() and see if it helps my problem.

Christoph:  Thoughts?

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 9/9] [RFC] nvme: Fix a race condition

2016-09-27 Thread Steve Wise
> @@ -2079,11 +2075,15 @@ EXPORT_SYMBOL_GPL(nvme_kill_queues);
>  void nvme_stop_queues(struct nvme_ctrl *ctrl)
>  {
>   struct nvme_ns *ns;
> + struct request_queue *q;
> 
>   mutex_lock(>namespaces_mutex);
>   list_for_each_entry(ns, >namespaces, list) {
> - blk_mq_cancel_requeue_work(ns->queue);
> - blk_mq_stop_hw_queues(ns->queue);
> + q = ns->queue;
> + blk_quiesce_queue(q);
> + blk_mq_cancel_requeue_work(q);
> + blk_mq_stop_hw_queues(q);
> + blk_resume_queue(q);
>   }
>   mutex_unlock(>namespaces_mutex);

Hey Bart, should nvme_stop_queues() really be resuming the blk queue?



--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html