Re: [PATCH 10/11] IB: only keep a single key in struct ib_mr

2016-01-06 Thread Jason Gunthorpe
On Wed, Jan 06, 2016 at 05:10:06PM +0200, Sagi Grimberg wrote:
> 
> >>>ULPs are *already* using the same registrations for both local and
> >>>remote access.
> >>
> >>Where? Out of tree?
> >
> >I haven't found anything in-tree for sure.
> 
> We have that in iSER.
> 
> iSCSI allows a FirstBurst functionality and iSER as an iSCSI
> transport is required to support that.
> 
> The FirstBurst is divided into ImmediateData which comes with
> the SCSI cdb and followed by UnsolicitedDataOut commands. This
> depends in the target support negotiation in the iSCSI login.

Why not use local_rdma_lkey for for the CDB part?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/11] IB: only keep a single key in struct ib_mr

2016-01-05 Thread Jason Gunthorpe
On Fri, Dec 25, 2015 at 06:46:07PM +0200, Liran Liss wrote:
> > From: Jason Gunthorpe <jguntho...@obsidianresearch.com>
> 
> > > >fill mr->key by the lkey or rkey based on that and everything will
> > > >work fine.
> > >
> > > But the ULP *can* register a memory buffer with local and remote
> > > access permissions.
> > Not in the new API.
> >
> > If a ULP ever comes along that does need that then they can start with
> > two MRs and then eventually upgrade the kapi to have some kind of
> > efficient bidir MR mode.
> 
> ULPs are *already* using the same registrations for both local and
> remote access.

Where? Out of tree?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 2/2] IB/core: Use hop-limit from IP stack for RoCE

2016-01-03 Thread Jason Gunthorpe
On Sun, Jan 03, 2016 at 03:59:11PM +0200, Matan Barak wrote:
> @@ -434,6 +434,7 @@ int ib_init_ah_from_wc(struct ib_device *device, u8 
> port_num,
>   int ret;
>   enum rdma_network_type net_type = RDMA_NETWORK_IB;
>   enum ib_gid_type gid_type = IB_GID_TYPE_IB;
> + int hoplimit = grh->hop_limit;

>   ah_attr->grh.flow_label = flow_class & 0xF;
> - ah_attr->grh.hop_limit = 0xFF;
> + ah_attr->grh.hop_limit = hoplimit;

No, this is wrong for IB. Please be careful to follow the IB
specification language for computing a hop limit on a reversible path.

No idea about rocee, but I can't believe using grh->hop_limit is right
there either.

Jason

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/11] IB: only keep a single key in struct ib_mr

2015-12-22 Thread Jason Gunthorpe
On Tue, Dec 22, 2015 at 06:58:14PM +0200, Sagi Grimberg wrote:
> 
> >The ULP decides if this MR is going to be used as a lkey or rkey
> >by passing IB_REG_LKEY or IB_REG_RKEY.  The HCA driver will then
> >fill mr->key by the lkey or rkey based on that and everything will
> >work fine.
> 
> But the ULP *can* register a memory buffer with local and remote
> access permissions.

Not in the new API.

If a ULP ever comes along that does need that then they can start with
two MRs and then eventually upgrade the kapi to have some kind of
efficient bidir MR mode.

What we've seen on the list lately is that every single ULP seems to
have technical problems running the stack properly. We need to get off
this idea that the spec has to govern the kapi - that didn't lead us
any place nice.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH libibverbs 0/3] Add cross-channel support

2015-12-20 Thread Jason Gunthorpe
On Sun, Dec 20, 2015 at 01:22:41PM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> This patchset adds supplementary part of cross-channel support [1]
> to libibverbs.

All of this needs some kind of man page update

And considering this is a non-standard feature it had better be a
really good man page with examples and so forth.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH rdma-next 1/6] IB/core: Save the device attributes on the device structure

2015-12-18 Thread Jason Gunthorpe
On Fri, Dec 18, 2015 at 10:08:41AM +0200, Or Gerlitz wrote:
> On 12/17/2015 7:41 PM, Jason Gunthorpe wrote:
> >On Thu, Dec 17, 2015 at 03:44:19PM +0200, Sagi Grimberg wrote:
> >>>+  ret = ib_query_device(device, >attrs);
> >>>+  if (ret) {
> >>>+  printk(KERN_WARNING "Couldn't query the device attributes\n");
> >>>+  goto out;
> >>>+  }
> >>>+
> >>I thought we're all for removing the call altogether aren't we?
> >>
> >>I'd say just call device->query_device() instead.
> >Christoph's patch even got rid of device->query_device(),
> 
> Wrong.

Not really, lots of hunks in Christoph's patch are removing
query_device ie:

@@ -1305,7 +1299,6 @@  int mthca_register_device(struct mthca_dev *dev)
dev->ib_dev.phys_port_cnt= dev->limits.num_ports;
dev->ib_dev.num_comp_vectors = 1;
dev->ib_dev.dma_device   = >pdev->dev;
-   dev->ib_dev.query_device = mthca_query_device;
dev->ib_dev.query_port   = mthca_query_port;

Sure, it sticks around in a couple places but it isn't 'query_device'
anymore, it is 'query_device_udata' which is reasonable.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH rdma-next 1/6] IB/core: Save the device attributes on the device structure

2015-12-17 Thread Jason Gunthorpe
On Thu, Dec 17, 2015 at 03:44:19PM +0200, Sagi Grimberg wrote:
> 
> >+ret = ib_query_device(device, >attrs);
> >+if (ret) {
> >+printk(KERN_WARNING "Couldn't query the device attributes\n");
> >+goto out;
> >+}
> >+
> 
> I thought we're all for removing the call altogether aren't we?
> 
> I'd say just call device->query_device() instead.

Christoph's patch even got rid of device->query_device(), which, IHMO,
I prefer to see over this. It re-enforces that these values are
constants and drivers cannot change them on the fly.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] svc_rdma: use local_dma_lkey

2015-12-16 Thread Jason Gunthorpe
On Wed, Dec 16, 2015 at 04:11:04PM +0100, Christoph Hellwig wrote:
> We now alwasy have a per-PD local_dma_lkey available.  Make use of that
> fact in svc_rdma and stop registering our own MR.
> 
> Signed-off-by: Christoph Hellwig <h...@lst.de>

Reviewed-by: Jason Gunthorpe <jguntho...@obsidianresearch.com>

> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> @@ -144,6 +144,7 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
>  
>   head->arg.pages[pg_no] = rqstp->rq_arg.pages[pg_no];
>   head->arg.page_len += len;
> +
>   head->arg.len += len;
>   if (!pg_off)
>   head->count++;

Was this hunk deliberate?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 00/11] Add RoCE v2 support

2015-12-16 Thread Jason Gunthorpe
On Wed, Dec 16, 2015 at 08:56:01AM +0200, Moni Shoua wrote:

> I can't object to that but I really would like to get an example of a
> security risk.

How can anyone give you an example when nobody knows exactly how mlx
hardware works in this area?

>From an kapi prespective, the security design is very simple.

Every single UD packet the kapi side has to process must be
unambiguously associated with a gid_index or dropped. Period full
stop. I would think that is an obvious conclusion based on the design
of the gid cache.

This is why we need a clear API to get this critical information. It
should not be open coded in init_ah_from_wc, it should not be done
some other way in the CMA code.

This is a simple matter of sane kapi design, nothing more.

I have no idea why this is so objectionable.

> scattered to the receive bufs anyway. So, if there is a security hole
> it exists from day one of the IB stack and this is not the time we
> should insist on fixing it.

IB isn't interacting with the net stack in the same way rocev2 is, so
this is not a pre-existing problem.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] infiniband:core:Add needed error path in cm_init_av_by_path

2015-12-16 Thread Jason Gunthorpe
On Wed, Dec 16, 2015 at 11:26:39AM +0100, Michael Wang wrote:
> 
> On 12/15/2015 06:30 PM, Jason Gunthorpe wrote:
> > On Tue, Dec 15, 2015 at 05:38:34PM +0100, Michael Wang wrote:
> >> The hop_limit is only suggest that the package allowed to be
> >> routed, not have to, correct?
> > 
> > If the hop limit is >= 2 (?) then the GRH is mandatory. The
> > SM will return this information in the PathRecord if it determines a
> > GRH is required. The whole stack follows this protocol.
> > 
> > The GRH is optional for in-subnet communications.
> 
> Thanks for the explain :-)
> 
> I've rechecked the ib_init_ah_from_path() again, and found it
> still set IB_AH_GRH when the GID cache missing, but with:

How do you mean?

ah_attr->ah_flags = IB_AH_GRH;
ah_attr->grh.dgid = rec->dgid;

ret = ib_find_cached_gid(device, >sgid, ndev, _num,
 _index);
if (ret) {
if (ndev)
dev_put(ndev);
return ret;
}

If find_cached_gid fails then ib_init_ah_from_path also fails.

Is there a case where ib_find_cached_gid can succeed but not return
good data?

I agree it would read nicer if the ah_flags and gr.dgid was moved
after the ib_find_cached_gid

> BTW, cma_sidr_rep_handler() also call ib_init_ah_from_path() with out
> a check on return.

That sounds like a problem.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 00/11] Add RoCE v2 support

2015-12-16 Thread Jason Gunthorpe
On Wed, Dec 16, 2015 at 09:57:02AM +, Liran Liss wrote:
> Currently, namespaces are not supported for RoCE.

IMHO, we should not be accepting rocev2 without at least basic
namespace support too, since it is fairly trivial to do based on the
work that is already done for verbs. An obvious missing piece is the
'wc to gid index' API I keep asking for.

> That said, we have everything we need for RoCE namespace support when we get 
> there.

Then there is no problem with the 'wc to gid index' stuff, so stop
complaining about it.

> All of this has nothing to do with "broken" and enshrining anything in the 
> kapi.
> That's just bullshit.

No, it is a critique of the bad kAPI choices in this patch that mean
it broadly doesn't use namespaces, net devices or IP routing
correctly.

> The design of the RDMA stack is that Verbs are used by core IB
> services, such as addressing.  For these services, as the
> specification requires, all relevant fields must be reported in the
> CQE, period.  All spec-compliant HW devices follow this.

Wrong, the kapi needs to meet the needs of the kernel, and is
influenced but not set by the various standards.

That means we get to make better choices in the kapi than exposing
wc.network_type.

> If a ULP wants to create an address handle from a completion, there
> are service routines to accomplish that, based on the reported
> fields.  If it doesn't care, there is no reason to sacrifice
> performance.

I have no idea why you think there would be a performance sacrifice,
maybe you should review the patches and my remarks again.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 00/11] Add RoCE v2 support

2015-12-16 Thread Jason Gunthorpe
On Wed, Dec 16, 2015 at 03:39:16PM -0500, Doug Ledford wrote:

> These patches add the concept of duplicate GIDs that are differentiated
> by their RoCE version (also called network type).

and by vlan, and smac, and ... Basically everything network unique about
a namespace has to be encapsulted in the gid index now. Each namespace
thus has a subset of gid indexes that are valid for it to use for
outbound and to recieve packets on.

roce didn't really have a way to work with net namespaces, AFAIK (?)
so it gets a pass.

But rocev2 very clearly does. It needs needs to address the issue
outlined in commit b8cab5dab15ff5c2acc3faefdde28919b0341c11 (IB/cma:
Accept connection without a valid netdev on RoCE)

That means cma.c needs to get the gid index every single CMA packet it
processes and confirm that the associated net device is permitted to
talk to the matching CM ID. 

It is no mistake there is a hole in cma.c waiting for this code, when
Haggai did that work it was very clear in my mind that rocev2 would
need to slot into here as well.

> Jason's objections are this:
> 
> 1)  The lazy resolution is wrong.

Wrong in the sense it doesn't actually exist in a usable form
anyplace.

cma.c does not do it, and absoultely must as discussed above.

init_ah_from_wc needs to do it, and maybe does. It is hard to tell,
perhaps a 'rdma_wc_to_dgid_index()' is actually open coded in there
now. Just from a code readability perspective that is ugly.

Then we get into the missing route handling in all places that
construct a rocev2 AH...

> Jason's preference would be that the above issues be resolved by
> skipping the lazy resolution and instead doing proactive resolution
> on

I am happy with lazy resolution, that is a fine compromise.

I just want to see kapi that makes sense here. It is very clear to me
no kernel user can possibly correctly touch a rocev2 UD packet without
retrieving the gid index, so we must have a kAPI for this.

> namespace.  Or, at a minimum, at least make the information added to the
> core API not something vendor specific like network_type, which is a
> detail of the Mellanox implementation.

I keep suggesting a rdma_wc_to_dgid_index() API call.

Perhasp most of he code for this already seems to exist in
init_ah_from_wc.

> 1 - Actually, for any received packet with associated IP address
> information.  We've only enabled net namespaces for IP connections
> between user space applications, for direct IB connections or for kernel
> connections there is not yet any namespace support.

IHMO, this is actually a problem for rocev2.

IB needs more work to create a rdma namespace, but rocve2 does not.

The kernel software side should certainly be completed as a quick
follow on to this series, that means the use of gid_indexes at all
uAPI access points needs to be checked for rocev2.

HW support is needed to complete rocve2 containment, as the hw must
check the gid_index on all directly posted WCs and *ALL* rx'd packets
for a QP to ensure it is allowed.

Some kind of warn on until that support is available would also be
great.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] infiniband:core:Add needed error path in cm_init_av_by_path

2015-12-15 Thread Jason Gunthorpe
On Tue, Dec 15, 2015 at 05:38:34PM +0100, Michael Wang wrote:
> The hop_limit is only suggest that the package allowed to be
> routed, not have to, correct?

If the hop limit is >= 2 (?) then the GRH is mandatory. The
SM will return this information in the PathRecord if it determines a
GRH is required. The whole stack follows this protocol.

The GRH is optional for in-subnet communications.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set

2015-12-15 Thread Jason Gunthorpe
On Tue, Dec 15, 2015 at 03:01:03PM -0500, Hal Rosenstock wrote:
> > I would agree, from my observation, that the two main byte counters
> > are always available.
> 
> The extended packet counts work but I thought there was a PMA with one
> of the extended byte counts wired to 0.

Can't remember

> > But not necessarily the others.
> 
> The unicast/multicast extended counters are not always supported -
> depends on setting of PerfMgt ClassPortInfo
> CapabilityMask.IsExtendedWidthSupportedNoIETF (bit 10).

Yes.. certainly this proposed patch needs to account for that and
continue to use the 32 bit ones in that case.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 00/11] Add RoCE v2 support

2015-12-15 Thread Jason Gunthorpe
On Tue, Dec 15, 2015 at 04:45:21PM -0500, Doug Ledford wrote:

> In particular, Liran piped up with this comment:
> 
> "Also, I don't want to do any route resolution on the Rx path. A UD QP
> completion just reports the details of the packet it received.
> 
> Conceptually, an incoming packet may not even match an SGID index at
> all.  Maybe, responses should be sent from another device. This should
> not be decided that the point that a packet was received."
> 
> The part that bothers me about this is that this statement makes sense
> when just thinking about the spec, as you say.  However, once you
> consider namespaces, security implications make this statement spec
> compliant, but still unacceptable.

This is where I got to and decided there is no desire to make a
correct patch for this stuff.

You are completely correct in your comments above.

> Since you and Jason did not reach a consensus, I have to dig in and see
> if these patches make it possible to break namespace confinement, either
> accidentally or with intentionally tricky behavior.  That's going to
> take me some time.

Everything to do with parsing a wc and constructing an AH is wrong in
this series, and the fixes require the API changes I mentioned ( add
'wc to gid index' API call, add 'route to AH' API call)

Every time you read 'route validation' - that is an error, the route
should never just be validated, it is needed information to construct
a rocev2 AH. All the places that roughly hand parse the rocev2 WC
should not be open coded.

Even if current HW is broken for namespaces we should not enshrine that
in the kapi.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set

2015-12-11 Thread Jason Gunthorpe
> qib, mlx4 are fine.  mlx5 should be as well I would think (I don't have that
> hardware.)

I have no specifics to add, but I keep running into systems, even
today, where the 64 bit counters don't work. The MAD might be there,
but several counters are wired to 0.

Not sure exactly which HW though.

Mellanox should really confirm this for their hardware matrix.

FWIW, I also hate the sysfs counters that reflect the PMA, these would
be much better are free running, wrapping, non-resetting counters
unrelated to the PMA. Something that doesn't zero after the SM samples
it. Sounds like qib, hif, rdmavt can trivially fix this, and should, IMHO.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set

2015-12-11 Thread Jason Gunthorpe
On Fri, Dec 11, 2015 at 07:23:13PM -0500, ira.weiny wrote:
> On Fri, Dec 11, 2015 at 05:00:47PM -0700, Jason Gunthorpe wrote:
> > 
> > FWIW, I also hate the sysfs counters that reflect the PMA, these would
> > be much better are free running, wrapping, non-resetting counters
> > unrelated to the PMA. Something that doesn't zero after the SM samples
> > it. Sounds like qib, hif, rdmavt can trivially fix this, and should, IMHO.
> 
> To be fair with a 64bit counter these are not going to get reset very often.

It resets when ever the SM sends a reset packet, so 'whenever'

> Furthermore, I don't think we can change the behavior now.

Sure we can, the restting is really a bug, to the point that nothing
can actually use the sysfs counters reliably.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/37] IB/rdmavt: Consolidate dma ops in rdmavt.

2015-12-10 Thread Jason Gunthorpe
On Thu, Dec 10, 2015 at 03:41:12PM -0800, Christoph Hellwig wrote:
> On Thu, Dec 10, 2015 at 10:44:13AM -0700, Jason Gunthorpe wrote:
> > > the signature of the function in struct ib_dma_mapping_ops.
> > 
> > It is supposed to be a dma_addr_t 'cookie' not a u64.
> > 
> > A patch to cleanup the core in this area would be appreciated.
> 
> I walked through the ib_dma_* mess in detail, and sadly speaking it
> has to be a u64.  This is due to the drivers being consolidated into
> rdmavt in fact.

> Those drivers use the addr field in struct ib_sge to point to a kernel
> virtual address, not to a DMA address.  In Linux u64 is the safe
> superset for a dma_addr_t and a pointer so we'll need to go with that.

Hrm.. sizeof(void *) > sizeof(dma_addr_t) seemed pretty obscure to me,
here is the original discussion:

https://lkml.org/lkml/2006/12/13/245

Sounds like someone was worried about sparc64. I doubt it is an actual
issue today, but granted the u64 did make some sense.

I probably would have just banned compiling qib on such a platform,
and kept the dma_addr_t annotations..

> Now these drivers will end up dma mapping these virtual addresses later,
> so we might want figure out a) why the qib & co drivers even need the
> virtual address, and b) see if we maybe should always do the dma_map
> in the callers anyway, and just have an additional virtual address field
> for those drivers if absolutely needed.

So, I *believe* the issue is that linux has (had?) no approved way to
convert from a device specific dma_addr_t to a virtual address.

I always understood that there was some kind of path in qib where it
needed to memcpy with the CPU the 'dma' data, so if it only had a
dma_addr then there was no way to make it work.

Certainly a future generic soft-rocee driver will need to cpu memcpy from
'dma' memory to a skb..

It is really too bad we can't just use get_dma_ops to handle this case
and instead require our own infrastructure.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-10 Thread Jason Gunthorpe
On Thu, Dec 10, 2015 at 09:58:51AM +0200, Moni Shoua wrote:

> > creating a horrible API, or requiring all vendors to implement a
> > network type flag.
> >
> Actually you haven't suggested anything yet besides a name to the function.
> I already said that calculating gid_index from wc and grh requires
> extra CPU cycles and is prone to mistakes. But, I might be wrong and
> you have a better idea. Do you?

I told you, if a driver requires vendor specific information then
include it as private data in the grh.

I keep saying this: computing the gid index is not optional. If a
drive can't do it efficiently then too bad, it has to go the long way
around.

>> I think mlx made a big mistake returning network_type instead of gid
>> index, and I don't want to see that error enshrined in our API.
>>
> returning gid_index is wrong because it forces CQ pollers to be aware
> of the entire table. Like I already mentioned, the GID table is a HW
> resource that can be divided and handed to multiple VMs,

Please. If HW virt stuff can't sort this out it is broken, there are
many trivial solutions.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)

2015-12-10 Thread Jason Gunthorpe
On Thu, Dec 10, 2015 at 11:07:37AM -0500, Chuck Lever wrote:
 
> Invasive IB core changes like this clean up are especially
> burdensome for me because NFS/RDMA changes do not normally
> go through Doug's tree, so it takes extra co-ordination.

The ARM folks do this sort of stuff on a regular basis.. Very early on
Doug prepares a topic branch with only the big change, NFS folks pull
it and then pull your work. Then Doug would send the topic branch to
Linus as soon as the merge window opens, then NFS would send theirs.

This is alot less work overall than trying to sequence multiple
patches over multiple releases..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/37] IB/rdmavt: Consolidate dma ops in rdmavt.

2015-12-10 Thread Jason Gunthorpe
On Thu, Dec 10, 2015 at 11:17:09AM -0500, Dennis Dalessandro wrote:
> On Tue, Dec 08, 2015 at 08:08:21AM +0200, Leon Romanovsky wrote:
> >On Mon, Dec 07, 2015 at 03:43:06PM -0500, Dennis Dalessandro wrote:
> >>+
> >>+#define BAD_DMA_ADDRESS ((u64)0)

Just use DMA_ERROR_CODE.

> >>+static u64 rvt_dma_map_single(struct ib_device *dev, void *cpu_addr,
> >>+ size_t size, enum dma_data_direction direction)
> >>+{
> >>+   if (WARN_ON(!valid_dma_direction(direction)))
> >>+   return BAD_DMA_ADDRESS;
> >>+
> >>+   return (u64)cpu_addr;
> >>+}
> >An example of such function.
> 
> Honestly I'm not really sure why it's done this way. We are just following
> the signature of the function in struct ib_dma_mapping_ops.

It is supposed to be a dma_addr_t 'cookie' not a u64.

A patch to cleanup the core in this area would be appreciated.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/37] IB/rdmavt: Consolidate dma ops in rdmavt.

2015-12-10 Thread Jason Gunthorpe
On Thu, Dec 10, 2015 at 04:46:24PM -0800, Christoph Hellwig wrote:
> On Thu, Dec 10, 2015 at 05:29:50PM -0700, Jason Gunthorpe wrote:
> > Hrm.. sizeof(void *) > sizeof(dma_addr_t) seemed pretty obscure to me,
> > here is the original discussion:
> > 
> > https://lkml.org/lkml/2006/12/13/245
> > 
> > Sounds like someone was worried about sparc64. I doubt it is an actual
> > issue today, but granted the u64 did make some sense.
> 
> sparc64 still uses a 32-bit dma_addr_t.

Hmm, too bad, that choice severly compromises the rdma userspace -
user space can't create mrs that exceed 4G in total :(

> > So, I *believe* the issue is that linux has (had?) no approved way to
> > convert from a device specific dma_addr_t to a virtual address.
> 
> Linux doesn't have an approved way because it's impossible for the
> generic case.  When you have an iommu you have potentially multiple
> page tables mapping physical addresses to device virtual addresses,
> and there is no easy way to do a reverse mapping.

Yes, I know

> > It is really too bad we can't just use get_dma_ops to handle this case
> > and instead require our own infrastructure.
> 
> FYI, I have a patch series in linux-next to switches all remaining
> architectures to use get_dma_ops, and there are plans to allow generic
> per-device dma_ops based on that.

Great, so once that is merged we can drop the ib_* versions of all
this and just have qib/etc customize get_dma_ops? Other than the
dma_addr_t size issue that sounds great..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)

2015-12-10 Thread Jason Gunthorpe
On Thu, Dec 10, 2015 at 11:56:50PM -0500, Doug Ledford wrote:

> Looking at struct netdevice, it has the sort of organization I would
> call reasonable.  Things like struct tx_stats is a struct, even though
> it's embedded in the parent struct and not a pointer and there is
> exactly and only one copy, so it could be flat.

struct net_device_stats exists because it is used in many more places
that just in struct net_device, so this is a code sharing thing.

> something I would call well organized.  And speaking of that, in struct
> netdevice, all of the various ops elements are handled as structs,

.. and 'struct net_device_ops' is used extensively.

Whereas once we get rid of the query call ib_device_attr has no second
user.

> including the base netdevice ops, whereas struct ib_device puts the base
> ops flat in the struct.  So if we wanted to be more like struct
> netdevice, we would move the base ops out of struct ib_device.

It would make the drivers a bit nicer if they initialized an ib_ops
structure like netdev does instead of in code, but performance wise
this is probably better, especially if we sorted the struct members
sanely..

> out like this.  An ib_device can be multi-port, and each port can be a
> different RDMA device type.

I thought we figured out that wasn't actually allowed today when
working on the caps rework?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] Ibacm: default pkey for partitioned fabrics

2015-12-09 Thread Jason Gunthorpe
On Wed, Dec 09, 2015 at 07:51:46AM -0500, Hal Rosenstock wrote:
> > ipoib always uses the 0 pkey index to create the default ipoib
> > interface. (see eg, update_parent_pkey)
> 
> This is beyond IBA spec and is currently a linux convention for IPoIB.
> IMO it should be changed to search for this pkey rather than assume
> it's

I don't think you are following. It uses pkey[0] as the pkey for
ipoib, not necessarily for SA communication.

Since there is no way to know what the desired pkey is for ipoib there
is no possibility to search. Using pkey index is 0 a good solution
since it allows the SM to configure ipoib defaults centrally.

> > When operating securely the SA should place the pkey for default ipoib
> > operation in pkey index 0, and place 0x7FFF in another index. I run
> > alot of networks exactly like this and it works very well.
> 
> Yes, it can run that way but more secure is without the full default
> pkey. When full default pkey is in every port, the rest of the
> partitioning doesn't really matter...

That isn't what I said, I said the pkey for the default ipoib
interface is in pkey[0], eg, the network runs with [0x8001,0x7FFF] as
the pkey table. There is no 0x pkey except on SA nodes.

Linux automatically creates ib0 on 0x8001 and the rest of the
in-kernel stack (should?) correctly find and use 0x7FFF as the pkey to
use to talk to the SA.

acm should follow ipoib convention for creating it's multicast groups
and setup it's default multicast group using pkey[0]

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] Ibacm: default pkey for partitioned fabrics

2015-12-09 Thread Jason Gunthorpe
On Wed, Dec 09, 2015 at 01:07:14PM +, Wan, Kaike wrote:

> > > + /* Determine the default pkey index for SA access first.
> > > +  *   Order of preference: 0x, 0x7fff, first pkey.
> > 
> > No, IBA says that only the default pkey should be used to talk to the SA,
> > every port needs 0x7FFF or the full mebership version. Do not search for the
> > first pkey.
> 
> We use the first pkey only if there is neither 0x7fff nor 0x in
> this port. If the port is in compliance with IB Spec, then we will
> be using either 0x or 0x7fff for SA access.

This is just confusing for readers, the IBA spec is very clear on what
pkey must be used to talk to the SA, don't ever use something
else. Follow the spec.

> > > +  * Determine the default pkey for parsing address file as well.
> > > +  *   order of preference: first full-member non-management pkey,
> > > +  *   0x, first pkey.
> > > +  */
> > 
> > This really should just be the 0 index pkey, which exactly matches how IPoIB
> > determines the default pkey, which is what matters when talking rdmacm..
> 
> It is true in most default configurations. However, since ibacm will
> use the default pkey for multicast, we want to make sure that it
> will not use a limited-member pkey to create/join a multicast group
> (practically of little use in this case) if such a pkey is placed at
> index 0.

If you don't follow the exact ipoib algorithm then you get different
answers in some cases and ugly subtle failure modes. Ie this algorithm
will not choose 0x as the pkey in cases where ipoib would - which
is not acceptable, IMHO.

If the 0 index pkey is not usable for ipoib then ipoib will be broken
too and that is far more likely to be noticed than if acm is broken.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC contig pages support 1/2] IB: Supports contiguous memory operations

2015-12-09 Thread Jason Gunthorpe
On Wed, Dec 09, 2015 at 10:00:02AM +, Shachar Raindel wrote:
> > Yes please.

> Note that other HW vendors are developing similar solutions, see for
> example:
> http://www.slideshare.net/linaroorg/hkg15106-replacing-cmem-meeting-tis-soc-shared-buffer-allocation-management-and-address-translation-requirements

CMA and it's successors are for something totally different.

> > We already have huge page mmaps, how much win is had by going from
> > huge page maps to this contiguous map?
> 
> As far as gain is concerned, we are seeing gains in two cases here:
> 1. If the system has lots of non-fragmented, free memory, you can
> create large contig blocks that are above the CPU huge page size.
> 2. If the system memory is very fragmented, you cannot allocate huge
> pages. However, an API that allows you to create small (i.e. 64KB,
> 128KB, etc.) contig blocks reduces the load on the HW page tables
> and caches.

I understand what it does, I was looking for performance numbers. The
last time I trivially benchmarked huge pages vs not huge pages on mlx4
I wasn't able to detect a performance difference.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-09 Thread Jason Gunthorpe
On Wed, Dec 09, 2015 at 11:34:10AM +0200, Moni Shoua wrote:
> > Eh? I think you've missed the point, there is no net device when
> > looking at a wc.
> >
> > Look, here is a concrete direction:
> >
> > Replace all the crap in
> > ib_init_ah_from_wc/get_sgid_index_from_eth/rdma_addr_find_dmac_by_grh
> >
> > with a straightforward
> >
> >rdma_dgid_index_from_wc(
> > const struct ib_qp *qp,
> > const struct ib_wc *wc,
> > const struct ib_grh *grh,
> > u16 *gid_index)
> >
> > Sort of function that reads the GRH and wc and returns the unambiguous
> > gid index that was used to receive that packet on the UD QP.
> >
> 
> I already answered this to but I'll do it again
> RoCEv2 spec says that L3 header will be scattered to receive WQE in
> the following way
> IPv6 and RoCEv1 - 40 bytes of the L3 header (GRH or IPv6) to the first
> 40 bytes of the receive bufs
> IPv4 - 20 bytes of the L3 header to the second half of the first 40
> bytes of the receive bufs. The first 20 bytes remain undefined.
> 
> Now, if you think how you deduce network_type from GRH you'll see that
> it requires tools like checksum validation and other validations and
> you end up with a method that is not 100% error free. So,to eliminate
> the need for heavy computation (with regards to the other option) and
> be free from false deductions you have the option of getting
> network_type from the hardware. So, if you do have hardware that
> supports it why give it up?

I understand how it works.

>From an API perspective, I want to see something that can be used
*correctly* and that means more along the lines of
rdma_dgid_index_from_wc and not what is proposed in this series.

That means not exposing the callers to this awful mess.

> > That said, I wouldn't object to vendor-specific bits in the wc. Ie if
> > mlx hardware needs a network_type bit to implement
> > rdma_find_dgid_index_from_wc, then fine - define a vendor specific
> > place to put it. In this case rdma_find_dgid_index_from_wc would be a
> > driver call back, which is fine, and what Caitlin was talking about.
> 
> This is not a Mellanox specific flag. See a quote from the spec
>
> A17.4.5.1 UD COMPLETION QUEUE ENTRIES (CQES)
> For UD, the Completion Queue Entry (CQE) includes remote address
> information (InfiniBand Specification Vol. 1 Rev 1.2.1 Section
> 11.4.2.1). For RoCEv2, the remote address information comprises the
> source L2 Address and a flag that indicates if the received frame is
> an IPv4, IPv6 or RoCE packet.

rdma_dgid_index_from_wc satisfies the above spec requirements without
creating a horrible API, or requiring all vendors to implement a
network type flag.

> > But, it is not part of our verbs API, and I'd *strongly* encourage
> > other vendors and future hardware to simply return the gid index that
> > the hardware matched instead of requiring the software to try and
> > guess after the fact.
> 
> Could be problematic for virtual machine architectures that give a
> portion of the entire GID table to a VM that index it 0..N

I doubt it.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-09 Thread Jason Gunthorpe
On Wed, Dec 09, 2015 at 09:38:21AM +, Liran Liss wrote:

> > But, it is not part of our verbs API, and I'd *strongly* encourage other
> > vendors and future hardware to simply return the gid index that the
> > hardware matched instead of requiring the software to try and guess after
> > the fact.
 
> You are pushing abstraction into provider code instead of handling it in a 
> generic way.

No, I am defining an API that *make sense* and doesn't leak useless
details. Of course that doesn't force code duplication or anyhting
like that, just implement it smartly.

I think mlx made a big mistake returning network_type instead of gid
index, and I don't want to see that error enshrined in our API.

> The Verbs are a low-level API, that should report exactly what was
> received from the wire.  In the RoCEv2 case, it should be the GID/IP
> addresses and the protocol type.  The addressing information is not
> intended to be used directly by applications; it is the raw bits
> that were accepted from the wire.

Low level details isn't what any in kernel consumer needs. Everything
in kernel needs the gid index to determine the namespace, routing and
other details. It is not optional. A common API is thus needed to do
this conversion.

API-wise, once you get the gid index then it is trivial to make easy
extractors for everything else. ie for example:
rdma_get_ud_src_sockaddr(gid_index,,wc,grh)
rdma_get_ud_dst_sockaddr(gid_index,,wc,grh)

> ib_init_ah_from_wc() and friends is exactly the place that you want
> to create an address handle based on completion and packet fields.

CMA needs exactly the same logic as well, the fact it doesn't have it
is a bug in this series.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 1/2] net/mlx5_core: Configure HW to support atomic request in host endianness

2015-12-08 Thread Jason Gunthorpe
On Tue, Dec 08, 2015 at 08:08:02PM +0200, eran ben elisha wrote:
> On Tue, Dec 8, 2015 at 7:17 PM, Jason Gunthorpe
> <jguntho...@obsidianresearch.com> wrote:
> > On Tue, Dec 08, 2015 at 06:54:15PM +0200, Eran Ben Elisha wrote:
> >> HW is capable of 2 requestor endianness modes for standard 8 Bytes
> >> atomic: BE (0x0) and host endianness (0x1). Read the supported modes
> >> from hca atomic capabilities and configure HW to host endianness mode if
> >> supported.
> >
> > Uh, isn't this a user visible thing?
> >
> > How will apps know if they need to swap or not?
> >
> > Jason
> 
> The problem is that some HWs cannot support IB spec regarding
> requester respond endianness.
> By default, HWs are set to return BE respond.
> If it can be configured to host endianness -> return IB_ATOMIC_HCA
> if not -> return IB_ATOMIC_NONE
> (Done in patch #2 in this series)
> 
> The state of BE respond is not visible to the user (atomic cap =
> IB_ATOMIC_NONE),
> App should behave according to the existing API and IB spec.

Okay, that sounds fine, details like that belong in the patch comments.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC contig pages support 1/2] IB: Supports contiguous memory operations

2015-12-08 Thread Jason Gunthorpe
On Tue, Dec 08, 2015 at 07:18:52AM -0800, Christoph Hellwig wrote:
> There is absolutely nothing IB specific here.  If you want to support
> anonymous mmaps to allocate large contiguous pages work with the MM
> folks on providing that in a generic fashion.

Yes please.

We already have huge page mmaps, how much win is had by going from
huge page maps to this contiguous map?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 1/2] net/mlx5_core: Configure HW to support atomic request in host endianness

2015-12-08 Thread Jason Gunthorpe
On Tue, Dec 08, 2015 at 06:54:15PM +0200, Eran Ben Elisha wrote:
> HW is capable of 2 requestor endianness modes for standard 8 Bytes
> atomic: BE (0x0) and host endianness (0x1). Read the supported modes
> from hca atomic capabilities and configure HW to host endianness mode if
> supported.

Uh, isn't this a user visible thing?

How will apps know if they need to swap or not?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/10] IB/qib: Use rdmavt lid defines in qib

2015-12-08 Thread Jason Gunthorpe
On Tue, Dec 08, 2015 at 02:24:55PM -0500, ira.weiny wrote:
> On Mon, Dec 07, 2015 at 01:54:39PM -0700, Jason Gunthorpe wrote:
> > On Mon, Dec 07, 2015 at 03:49:12PM -0500, Dennis Dalessandro wrote:
> > >   /* A multicast address requires a GRH (see ch. 8.4.1). */
> > > - if (ah_attr->dlid >= QIB_MULTICAST_LID_BASE &&
> > > - ah_attr->dlid != QIB_PERMISSIVE_LID &&
> > > + if (ah_attr->dlid >= be16_to_cpu(IB_MULTICAST_LID_BASE) &&
> > > + ah_attr->dlid != be16_to_cpu(IB_LID_PERMISSIVE) &&
> > 
> > Uh cpu_to_be16 please..
> 
> But, the defines are big endian and the dlid here is cpu endian.

Hurm, I can't even find the patch that adds IB_MULTICAST_LID_BASE..

But I believe you, and think that is pretty gross to have the constant
be BE when typical uses like this are CPU..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)

2015-12-08 Thread Jason Gunthorpe
On Wed, Dec 09, 2015 at 12:47:55AM +0200, Or Gerlitz wrote:
> On Wed, Dec 9, 2015 at 12:04 AM, Doug Ledford  wrote:
> > Makes sense.
> 
> thanks.
> 
> > Show me what you are talking about (either a link to Ira's
> > patch you are referring to or your own patch).
> 
> The patch is three liner to add the cached attrs --
> http://marc.info/?l=linux-rdma=142309296813985=2 -- if you are OK
> with that, I will add a 2nd patch that ports all ULPs to use the
> cached copy instead of their code which does the query.
> 
> Actually, why not start with this approach and later decide if we need
> to go further of this is enough?

Or, can we please stop this bikeshedding. Christoph's patch is done,
well tested and does a lot more clean up that this hacky three liner.

It is a good patch, and although patchworks doesn't have my remarks
from an earlier revision I still think it should go forward. 

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/10] IB/qib: Use rdmavt lid defines in qib

2015-12-08 Thread Jason Gunthorpe
On Tue, Dec 08, 2015 at 06:19:41PM -0500, Dennis Dalessandro wrote:

> So what is the consensus here? Should we leave it alone for now and
> potentially go back and deal with this separately?  Just define the new one
> as LE and use it, even though it doesn't match the rest?  Something else
> entirely?

I'm okay if you stick with what you have, there are too many other
cases of byteswapped constants to get too excited I guess. :(

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-08 Thread Jason Gunthorpe
On Tue, Dec 08, 2015 at 09:23:18AM +0200, Moni Shoua wrote:
> >> Now, when same gid value can be present in multiple entries you need
> >> an extra parameter to get distinguish between them - that would be
> >> the netwrok_type
> >
> > Eh?  You are only worried about duplicates between rocev1 mode which
> > uses the GUID and v2 mode which uses the host's IP address?
> >
> > What about duplicates entirely within v2 mode where vlan and macvlan
> > can create duplicate host IP addresses.

> In that case these entries will be distinguished  byt its eth devices
> (eth device is part of the gid attributes structure)

Eh? I think you've missed the point, there is no net device when
looking at a wc.

Look, here is a concrete direction:

Replace all the crap in
ib_init_ah_from_wc/get_sgid_index_from_eth/rdma_addr_find_dmac_by_grh

with a straightforward

   rdma_dgid_index_from_wc(
const struct ib_qp *qp,
const struct ib_wc *wc,
const struct ib_grh *grh,
u16 *gid_index)

Sort of function that reads the GRH and wc and returns the unambiguous
gid index that was used to receive that packet on the UD QP.

The gid cache is not allowed to create an ambiguity the driver cannot
resolve.

That said, I wouldn't object to vendor-specific bits in the wc. Ie if
mlx hardware needs a network_type bit to implement
rdma_find_dgid_index_from_wc, then fine - define a vendor specific
place to put it. In this case rdma_find_dgid_index_from_wc would be a
driver call back, which is fine, and what Caitlin was talking about.

But, it is not part of our verbs API, and I'd *strongly* encourage
other vendors and future hardware to simply return the gid index that
the hardware matched instead of requiring the software to try and
guess after the fact.

This is the same issue/argument we went around and around on the
lladdr ipoib details when working on the namespace patches, about how
important it is to resolve the namespace from the hardware headers.

Of course once we have the gid index we now have the net device and
other information needed to make namespaces work.

.. and this is part of what I mean what I said the interface from the
gid cache code is not a sane API and needs to be changed.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)

2015-12-08 Thread Jason Gunthorpe
On Tue, Dec 08, 2015 at 03:02:44PM -0800, Christoph Hellwig wrote:
> On Tue, Dec 08, 2015 at 03:59:40PM -0700, Jason Gunthorpe wrote:
> > Or, can we please stop this bikeshedding. Christoph's patch is done,
> > well tested and does a lot more clean up that this hacky three liner.
> > 
> > It is a good patch, and although patchworks doesn't have my remarks
> > from an earlier revision I still think it should go forward. 
> 
> While I'd prefer the version Or points to over not having anything
> at all I'd much prefer sorting it properly and make the RDMA code
> behave like all other Linux subsystems.

*shrug* as long as we get the purge of ib_query_device and the
assocaited migration of the driver code to the register function, I
don't care very much what the names are either. device->max_map_per_fmr vs
device->attr.max_map_per_fmr is just bikeshedding.

> Jason, can you give me a formal ACK'ed by and I'll rebase it on top
> of the current 4.4 queue so we could start the 4.5 window with it.

I looked over the version in patchworks again, looks OK to me:

Reviewed-by: Jason Gunthorpe <jguntho...@obsidianresearch.com>

Thanks,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] Ibacm: default pkey for partitioned fabrics

2015-12-08 Thread Jason Gunthorpe
On Tue, Dec 08, 2015 at 12:33:02PM -0500, kaike@intel.com wrote:
> From: Kaike Wan 
> 
> In an insecure IB fabric, the default pkey in a port is 0x, where each
> node is allowed to talk to any other node in the fabric, including the SA
> node. However, in a secure fabric, to limit member access, not all nodes
> can have the full-member default pkey 0x. A typical configuration is
> to let SA node have pkey 0x while all other nodes have pkey 0x7fff; in
> addition, each node can be assigned some other full-member pkeys, such as
> 0x8001 and 0x8002, so that it can be assigned to different partitions.
> In this case, each node can access SA, and yet limits its other access to
> only those nodes in its assigned partitions. In such a secure fabric,
> however, ibacm will not work by interpreting "default" in its default
> address file as 0x.

ipoib always uses the 0 pkey index to create the default ipoib
interface. (see eg, update_parent_pkey)

When operating securely the SA should place the pkey for default ipoib
operation in pkey index 0, and place 0x7FFF in another index. I run
alot of networks exactly like this and it works very well.

This ensures that ipoib works out of the box without additional
configuration.

> + /* Determine the default pkey index for SA access first.
> +  *   Order of preference: 0x, 0x7fff, first pkey.

No, IBA says that only the default pkey should be used to talk to the
SA, every port needs 0x7FFF or the full mebership version. Do not
search for the first pkey.

> +  * Determine the default pkey for parsing address file as well.
> +  *   order of preference: first full-member non-management pkey,
> +  *   0x, first pkey.
> +  */

This really should just be the 0 index pkey, which exactly matches how
IPoIB determines the default pkey, which is what matters when talking
rdmacm..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-07 Thread Jason Gunthorpe
On Mon, Dec 07, 2015 at 08:37:41AM +0200, Moni Shoua wrote:
> On Mon, Dec 7, 2015 at 8:34 AM, Jason Gunthorpe
> <jguntho...@obsidianresearch.com> wrote:
> > On Mon, Dec 07, 2015 at 08:15:40AM +0200, Moni Shoua wrote:
> >
> >> What you have though is the sgid taken from the GRH that is scattered
> >> to the first 40/20 bytes of the receive WQE.  This is not enough to
> >> determine the network type.
> >
> > It is enough to discover the sgid index which will tell you the type.
> but how? all you have in hand is the sgid which can appear several
> times in the GID table in different indices.

Eh? Reliably recovering the gid index is not optional. The network
namespace stuff that is already in the kernel hard requires that
ability. (This is the same argument we went around on already why pkey
had to come from the wc, not from the payload)

If your position is it cannot be done from a WC,QP as is, then
gid_index needs to be added to the wc, or something else to remove the
ambiguity.

In either case, network_type is absolutely the wrong thing to have in
the wc.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/10] IB/qib: Use rdmavt lid defines in qib

2015-12-07 Thread Jason Gunthorpe
On Mon, Dec 07, 2015 at 03:49:12PM -0500, Dennis Dalessandro wrote:
>   /* A multicast address requires a GRH (see ch. 8.4.1). */
> - if (ah_attr->dlid >= QIB_MULTICAST_LID_BASE &&
> - ah_attr->dlid != QIB_PERMISSIVE_LID &&
> + if (ah_attr->dlid >= be16_to_cpu(IB_MULTICAST_LID_BASE) &&
> + ah_attr->dlid != be16_to_cpu(IB_LID_PERMISSIVE) &&

Uh cpu_to_be16 please..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/37] IB/rdmavt: Add query and modify device stubs

2015-12-07 Thread Jason Gunthorpe
On Mon, Dec 07, 2015 at 09:26:11PM +, Hefty, Sean wrote:
> > +static int rvt_query_device(struct ib_device *ibdev,
> > +   struct ib_device_attr *props,
> > +   struct ib_udata *uhw)
> > +{
> > +   /*
> > +* Return rvt_dev_info.props contents
> > +*/
> > +   return -EINVAL;
> 
> ENOSYS on all of the function templates.  This and subsequent patches.

We recently had a long discussion about what the correct answer here
is.

ENOSYS and EINVAL are both wrong for different reasons.. Can't recall
if something else was settled on? I think I was suggesting one of the
ENOTSUP varients?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-07 Thread Jason Gunthorpe
On Mon, Dec 07, 2015 at 08:34:43PM +0200, Moni Shoua wrote:
> Well, just tell me how you want to discover gid_index when you poll
> the WC out of the CQ.

Hey, I'm not desiging this rocev2 stuff, this is something the rocev2
community needs to sort out.

> Like I said, the sgid itself is in the GRH that is scattered to the
> buffers in the receive queue. When ib_poll_cq() is called the pointer
> to GRH is not passed so there is no way to determine the gid_index
> inside the polling context.

Then have an API to let the consumer recover it.

> BTW, why do you argue about this only now? This is not RoCE specific
> issue. sgid_index was always resolved outside the polling context. The
> only difference between then and now is that with InfiniBand there was
> no conflict about sgid_index once you had the sgid.

I think you answered your own question. In IB and rocev1 the sgid
index is not ambiguous, rocev2 breaks that requirement without
adaquately fixing it.

It is probably also the case that rocev1 has similar problems,
depending on how vland and macvlan ended up being implemented.

> Now, when same gid value can be present in multiple entries you need
> an extra parameter to get distinguish between them - that would be
> the netwrok_type

Eh?  You are only worried about duplicates between rocev1 mode which
uses the GUID and v2 mode which uses the host's IP address?

What about duplicates entirely within v2 mode where vlan and macvlan
can create duplicate host IP addresses.

You can also just not create duplicate entries in the first place. For
instance there probably isn't a really a good reason to use rocev2 for
IPv6 link local addreses, that trivially eliminates the v1/v2
collision.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-06 Thread Jason Gunthorpe
On Mon, Dec 07, 2015 at 08:15:40AM +0200, Moni Shoua wrote:

> What you have though is the sgid taken from the GRH that is scattered
> to the first 40/20 bytes of the receive WQE.  This is not enough to
> determine the network type.

It is enough to discover the sgid index which will tell you the type.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-03 Thread Jason Gunthorpe
On Thu, Dec 03, 2015 at 03:47:12PM +0200, Matan Barak wrote:
> From: Somnath Kotur 
> 
> Providers should tell IB core the wc's network type.
> This is used in order to search for the proper GID in the
> GID table. When using HCAs that can't provide this info,
> IB core tries to deep examine the packet and extract
> the GID type by itself.

Eh? A wc has a sgid_index, and in this brave new world a gid has the
network type. Why do we need to specify it again?

>   memset(ah_attr, 0, sizeof *ah_attr);
>   if (rdma_cap_eth_ah(device, port_num)) {
> + if (wc->wc_flags & IB_WC_WITH_NETWORK_HDR_TYPE)
> + net_type = wc->network_hdr_type;
> + else
> + net_type = ib_get_net_type_by_grh(device, port_num, 
> grh);
> + gid_type = ib_network_to_gid_type(net_type);

Like here for instance.

... and I keep saying this is all wrong, once you get into IP land
this entire process needs a route/neighbour lookup.


> - ret = rdma_addr_find_dmac_by_grh(>dgid, >sgid,
> + ret = rdma_addr_find_dmac_by_grh(, ,
>ah_attr->dmac,
>wc->wc_flags & 
> IB_WC_WITH_VLAN ?
>NULL : _id,

ie no to this.

> + if (sgid_attr.gid_type == IB_GID_TYPE_ROCE_UDP_ENCAP)
> + /* TODO: get the hoplimit from the inet/inet6
> +  * device
> +  */

And no again, please fix this and all other missing route lookups
before sending another version.

> + struct {
> + /* The IB spec states that if it's IPv4, the header

roceev2 spec, surely

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stalled again

2015-12-02 Thread Jason Gunthorpe
On Wed, Dec 02, 2015 at 04:29:27AM -0800, Christoph Hellwig wrote:
> Maybe we need more co-maintainers if the two current maintainers can't
> handle the workload?

Doug is the only maintainer. The idea of co-maintainers was rejected
by some parties.

Don't be confused, the other 'M' people in the MAINTAINERS file are
really doing the 'R' job.

If Doug wasn't able to become full time on this project we should
probably revisit that.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/17] staging/rdma/hfi1: Fix qp.h comments

2015-12-01 Thread Jason Gunthorpe
On Tue, Dec 01, 2015 at 03:38:13PM -0500, Jubin John wrote:
> From: Kaike Wan 
> 
> This patch fixes a few incorrect header file comments in qp.h

FWIW, within drivers/infiniband we've been moving these comments
to the implementation, not the function prototype.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V1 5/9] IB/core: Add rdma_network_type to wc

2015-11-25 Thread Jason Gunthorpe
On Wed, Nov 25, 2015 at 04:18:25PM +0200, Matan Barak wrote:
> On Wed, Nov 25, 2015 at 8:55 AM, Jason Gunthorpe
> <jguntho...@obsidianresearch.com> wrote:
> > On Tue, Nov 24, 2015 at 09:07:41PM +0200, Matan Barak wrote:
> >
> >> IMHO, the user is entitles to choose any valid sgid_index for the
> >> interface. Anything he chooses guaranteed to be valid (from security
> >> perspective)
> >
> > No, the namespace patches will have to limit the sgid_indexes that can
> > be used with a QP to those that fall within the namespace. This is
> > another reason I don't like this approach for the kapi.
> 
> By saying namespace, do you mean net namespaces?

Whatever it turns out to be, Haggie was talking about rdma namespaces
for some for this stuff too, but IMHO, rocev2 is pretty clearly
covered under net namespaces.

> If so, the gid cache allows to search by net device (and there's a
> "custom" search that the user can define a filter function which can
> filter by net).
> Anyway, I don't think this cache should be used other than a simple database.

It has nothing to do with the cache, it is everywhere else, you can't
create a qp with a sgid index that is not part of your namespace, for
instance, or recieve a packet on a QP outside your namespace,
etc. Lots of details.



> >> Why do we need to block users who use ibv_rc_pingpong and chose the
> >> GID index correctly by hand?
> >
> > I'm not really concerned with user space, we are stuck with exporting
> > the gid index there.
> 
> So why do we need to block kernel applications from doing the same
> things user-space application can do?

As I explained, it is never correct to use a naked sgid_index and
roceve2, uverbs can't be fixed without a uapi change, but the kernel
can be.

> If there are kernel consumers that want to work with verbs directly,
> they should use ib_init_ah_from_wc and ib_resolve_eth_dmac (or we
> can

As I already said these functions are wrong, they don't have the
routing lookup needed for rocev2. That is my whole point, the
functions that are using the gid cache for rocev2 are *not correct*

I don't really care how you fix it, but every rocev2 sgid-index lookup
in the kernel must be accompanied by a route lookup.

I think the gid cache API design is wrong here because it doesn't
force the above, but whatever, if you choose a different API it
becomes your job to review every patch from now own to make sure other
people use your dangerous API properly.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V1 9/9] IB/cma: Join and leave multicast groups with IGMP

2015-11-25 Thread Jason Gunthorpe
On Wed, Nov 25, 2015 at 10:31:15AM +0200, Moni Shoua wrote:
> On Tue, Nov 24, 2015 at 8:15 PM, Jason Gunthorpe
> <jguntho...@obsidianresearch.com> wrote:
> > On Tue, Nov 24, 2015 at 11:41:10AM +0200, Moni Shoua wrote:
> >> On Mon, Nov 23, 2015 at 11:25 PM, Jason Gunthorpe
> >> <jguntho...@obsidianresearch.com> wrote:
> >> > On Thu, Oct 15, 2015 at 07:07:12PM +0300, Matan Barak wrote:
> >> >> diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h
> >> >> index 0a40ed2..5bea0e8 100644
> >> >> +++ b/include/rdma/ib_sa.h
> >> >> @@ -206,6 +206,9 @@ struct ib_sa_mcmember_rec {
> >> >>   u8   scope;
> >> >>   u8   join_state;
> >> >>   int  proxy_join;
> >> >> + int  ifindex;
> >> >> + struct net  *net;
> >> >> + enum ib_gid_type gid_type;
> >> >>  };
> >> >
> >> > This is really gross.
> >> >
> >> > Make ib_init_ah_from_mcmember accept a QP and get the above stuff from
> >> > the QP.
> >> >
> >> > Jason
> >>
> >> Which QP is that. You might not have any existing QP when you want to
> >> create the AH or you might have 10.
> >
> > roce multicast is only done with the CM and the CM always has a QP.
> >
> I don't see why you can't join before having a QP and anyway,
> rdma_create_qp() is optional

Ugh, gross, why would anyone want to do that..

Doesn't change my point, the CM id is bound before multicast join can
run, don't pollute ib_sa_mcmember_rec with this.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR

2015-11-25 Thread Jason Gunthorpe
On Wed, Nov 25, 2015 at 09:09:20AM -0800, santosh shilimkar wrote:
> >I'd say drop the current iWarp transport if it's not testable.  The
> >only real difference between IB and iWarp is the needed to create
> >a MR for the RDMA READ sink, and we're much better of adding that into
> >the current IB transport if new iWarp users show up.
> 
> Agree. I will probably do it along with RDS IB FR support in 4.6

Great news!

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/9] IB: add a proper completion queue abstraction

2015-11-24 Thread Jason Gunthorpe
On Tue, Nov 24, 2015 at 06:47:14PM -0800, Caitlin Bestler wrote:
> Acknowledge packet for the last packet of the request has been
> committed to the wire (including the appropriate fields for RDMA
> READ response).

> The target side cannot generate a response before it receives the request.
> The target side verbs cannot generate the completion before the acknowledge
> packet has been
> committed to the wire.

Sure, but, I keep saying this, the responder behavior is largely
irrelevant to what the target is able/required to do.

> Therefore the initiating side cannot receive a response before the write
> operation has completed.

Wrong. The ladder diagram would be

Requestor  Responder   Responder Verbs

SEND1 --->   Process
 X-  ACK (lost)
 > recv1 CQ
 <---   send2 WQE
recv2 CQE <  SEND2 Packet
[..]
send1 CQE <  ACK (resent)

The Ack may be lost, causing the send CQE to arrive after the recv
CQE, even though the responder did everything in a specific order.

The fundamental issue is that the responder cannot detect the lost
ACK. The PSN of the ACK packet is part of the Requestor's PSN space,
not part of the Responders:

 9.7.5.1.1 GENERATING PSNs FOR ACKNOWLEDGE MESSAGES

 C9-95: For responses to SEND requests or RDMA WRITE requests the
  responder shall insert in the PSN field of the response the PSN of the
  most recent request packet being acknowledged.

Or stated another way, the value of the AckReq bit in SEND1 has no
impact on the contents of the SEND2 packet - thus there is no way for
the requestor to detect the loss of the ACK and hold off delivering
recv2 CQE.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V1 5/9] IB/core: Add rdma_network_type to wc

2015-11-24 Thread Jason Gunthorpe
On Tue, Nov 24, 2015 at 09:07:41PM +0200, Matan Barak wrote:

> IMHO, the user is entitles to choose any valid sgid_index for the
> interface. Anything he chooses guaranteed to be valid (from security
> perspective)

No, the namespace patches will have to limit the sgid_indexes that can
be used with a QP to those that fall within the namespace. This is
another reason I don't like this approach for the kapi.

> Why do we need to block users who use ibv_rc_pingpong and chose the
> GID index correctly by hand?

I'm not really concerned with user space, we are stuck with exporting
the gid index there.

> > OK. Change the gid cache so only a RDMA CM private API can return
> > rocev2 gids.
> 
> So you propose to block verbs applications from using the RoCE v2 GIDs? Why?

Just the kernel consumers, so the in-kernel users are correct.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/9] IB: add a proper completion queue abstraction

2015-11-24 Thread Jason Gunthorpe
On Tue, Nov 24, 2015 at 10:08:39AM -0700, c...@asomi.com wrote:
>>> My recollection of the IB verbs is that they were unlikely to have
>>> overlooked something like that. If it did slip through then there
>>> should be an errata.
>>verbs reflects the wire protocol and the wire protocol has no way to
>>create a linkage between the send and recv sides of a RC QP. It is not
>>a spec bug, there is no errata.
> 
>No, verbs do not reflect the link layer. Verbs fulfill the contract
>with the end user - especially when the hardware cannot.

Are you serious? You think verbs should specify things that cannot
actually be implemented? IB verbs doesn't do that.

>Reporting a completion of a receive without having guaranteed
>that the acknowledgement will be sent before anything
>subsequently submitted is failing to implement a reliable
>connection.

As I've said repeatedly, the side receiving the request has no impact
on this issue. Requiring it to generate an ack before delivering the
recv completion *does nothing* to guarantee ordering on the other
side in IB. The network can always drop the ack, and the receiver
cannot tell.

I disagree that RC has anything to do with this causal ordering issue,
this is not a property described in the IB spec for RC.

>One of the hardware designs I worked with over a decade ago
>actually had *two* completion queues, one from the left brain (receive)
>and the other from the right brain (transmit). Nevertheless we
>presented

All verbs hardware has to have these two flows.

>a unified completion queue that followed ordering rules. We were not
>"reflecting" the hardware, we were implementing an interface in a
>combination of hardware and software.

And how did you detect lost acks at the reciever to make this an
actual guarentee? Maybe TCP can do this but IB cannot.

I honestly don't know why you think verbs has this requirement. It is
not in the IB spec, and can't be implemented by IB hardware. If
one of the other flavours of verbs has this, then fine, but it is not
part of the *Linux* verbs flavour.

> I cannot think of a reason why that would not work here. Simply do
> not report a completion to the user before guaranteeing that the
> acks will be transmitted.

I've already explained three times why this is not enough for IB.

>What would you do if the next user action had been to close the
>connection?  Would the acks have been lost?

No acks will be lost. The IB termination process requires the closing
side to drain it's sendq and stop injecting new traffic before sending
the CM close message. The other must also drain the sendq before
replying. This handshake ensures the QP is quiescent and all in flight
traffic completed before the QP is destroyed.

IB basically shifts the lingering close into the CM arena, where it
costs less, and allows the QP and associated HW resources to be
destroyed quickly.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V1 5/9] IB/core: Add rdma_network_type to wc

2015-11-24 Thread Jason Gunthorpe
On Tue, Nov 24, 2015 at 03:47:51PM +0200, Matan Barak wrote:
> > It isn't just the hop limit that has to come from the route entry, all
> > the source information of the path comes from there. Ie the gid table
> > should accept the route entry directly and spit out the sgid_index.
> >
> > The responder side is the same, it also needs to do a route lookup to
> > figure out what it is doing, and that may not match what the rx says
> > from the headers. This is important stuff.
> >
> 
> The only entity that translates between IPs and GIDs is the RDMACM.

The rocev2 stuff is using IP, and the gid entry is now overloaded to
specify IP header fields.

Absolutely every determination of IP header fields needs to go through
the route table, so every single lookup that can return a rocev2 SGID
*MUST* use route data.

The places in this series where that isn't done are plain and simply
wrong.

The abstraction at the gid cache is making it too easy to make this
mistake. It is enabling callers to do direct gid lookups without a
route lookup, which is unconditionally wrong. Every call site into the
gid cache I looked at appears to have this problem.

The simplest fix is to have a new gid cache api for rocve2 that
somehow forces/includes the necessary route lookup. The existing API
cannot simply be extended for rocev2.

> roce_gid_mgmt, is the part that populates this "dumb" database.
> IMHO, adding such a "smart" layer to the GID cache is wrong, as this
> should be part of RDMACM which does the translation. No need to get
> the gid cache involved.

OK. Change the gid cache so only a RDMA CM private API can return
rocev2 gids.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V1 9/9] IB/cma: Join and leave multicast groups with IGMP

2015-11-24 Thread Jason Gunthorpe
On Tue, Nov 24, 2015 at 11:41:10AM +0200, Moni Shoua wrote:
> On Mon, Nov 23, 2015 at 11:25 PM, Jason Gunthorpe
> <jguntho...@obsidianresearch.com> wrote:
> > On Thu, Oct 15, 2015 at 07:07:12PM +0300, Matan Barak wrote:
> >> diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h
> >> index 0a40ed2..5bea0e8 100644
> >> +++ b/include/rdma/ib_sa.h
> >> @@ -206,6 +206,9 @@ struct ib_sa_mcmember_rec {
> >>   u8   scope;
> >>   u8   join_state;
> >>   int  proxy_join;
> >> + int  ifindex;
> >> + struct net  *net;
> >> + enum ib_gid_type gid_type;
> >>  };
> >
> > This is really gross.
> >
> > Make ib_init_ah_from_mcmember accept a QP and get the above stuff from
> > the QP.
> >
> > Jason
> 
> Which QP is that. You might not have any existing QP when you want to
> create the AH or you might have 10.

roce multicast is only done with the CM and the CM always has a QP.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 03/10] IB/iser: Don't register memory for all immediatedata writes

2015-11-24 Thread Jason Gunthorpe
On Tue, Nov 24, 2015 at 12:04:36PM +0200, Sagi Grimberg wrote:
> Jason and Or,
> 
> >
> >I'm with Or on this, this is really goofy looking.
> >
> >This routine probably should not be setting the rkey at all, it makes
> >no sense to have a routine that returns a lkey and a rkey. Those are
> >always different flows.
> >
> >Once that is fixed then the above if can be hoisted to the actual
> >calling code that needs an rkey, at the point where it does something
> >different when !iser_always_reg is true..
> 
> This change is acceptable to me. However, given that it includes
> rework outside the scope of this specific patch which is a performance
> regression fix, would it be acceptable to have it incremental and
> provide a FIXME comment here?

I'm not bothered by the ordering as long as we are clear where things
are going :)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V1 3/9] IB/core: Add gid attributes to sysfs

2015-11-24 Thread Jason Gunthorpe
On Tue, Nov 24, 2015 at 03:49:28PM +0200, Matan Barak wrote:
> Of course that if there is no such documentation, I can add a new file
> for the sysfs ABI defined in this patch.

That is probably needed, our old stuff predates this new process.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method

2015-11-24 Thread Jason Gunthorpe
On Tue, Nov 24, 2015 at 09:39:33AM -0500, Chuck Lever wrote:
> >> Do we really want to go down that road?  It seems like we've decided
> >> in general that while the protocol specs say MR must be unmapped before
> >> proceeding with the data that is painful enough to ignore this
> >> requirement.  E.g. iser for example only does the local invalidate
> >> just before reusing the MR.
> 
> That leaves the MR exposed to the remote indefinitely. If
> the MR is registered for remote write, that seems hazardous.

Agree.

If we have a performance knob it should be to post the invalidate,
then immediately dma unmap and start using the buffer, not to defer
the invalidate potentially indefinitely.  Then it is just a race to be
exploited not a long lived window into random page cache memory.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/ipoib: CSUM support in connected mode

2015-11-24 Thread Jason Gunthorpe
On Tue, Nov 24, 2015 at 10:45:14AM +0100, Yann Droneaud wrote:

> > Same as my question above, if peer supports this feature do you see
> > anything wrong?
> 
> If peer is going to forward this packet to a different network, which
> is not IPoIB based, the checksum will be checked and the packet will be
> rejected.

Exactly, this is why this approach has never been acceptable for
mainline. Arrange things so ipoib can use skb_partial_csum_set.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/11] IB: only keep a single key in struct ib_mr

2015-11-23 Thread Jason Gunthorpe
On Sun, Nov 22, 2015 at 06:46:48PM +0100, Christoph Hellwig wrote:
> While IB supports the notion of returning separate local and remote keys
> from a memory registration, the iWarp spec doesn't and neither does any
> of our in-tree HCA drivers [1] nor consumers.  Consolidate the in-kernel
> API to provide only a single key and make everyones life easier.
> 
> [1] the EHCA driver which is in the staging tree on it's way out can
> actually return two values from it's thick firmware interface.
> I doubt they ever were different, though.

I like this too, but, I'm a little worried this makes the API more
confusing - ideally, we'd get rid of all the IB_ACCESS stuff from
within the kernel completely. Then it make sense, the mr is created as
a rkey/lkey mr, the single return is only one kind, and that kind must
go in the lkey/rkey spots in other API places.

Anyhow, for the core stuff:

Reviewed-by: Jason Gunthorpe <jguntho...@obsidianresearch.com>

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 03/10] IB/iser: Don't register memory for all immediatedata writes

2015-11-23 Thread Jason Gunthorpe
On Tue, Nov 17, 2015 at 11:41:39AM +0200, Sagi Grimberg wrote:
> 
> >On 11/16/2015 6:37 PM, Sagi Grimberg wrote:
> >>+++ b/drivers/infiniband/ulp/iser/iser_memory.c
> >>@@ -250,7 +250,7 @@ iser_reg_dma(struct iser_device *device, struct
> >>iser_data_buf *mem,
> >>  struct scatterlist *sg = mem->sg;
> >>  reg->sge.lkey = device->pd->local_dma_lkey;
> >>-reg->rkey = device->mr->rkey;
> >>+reg->rkey = device->mr ? device->mr->rkey : 0;
> >>  reg->sge.addr = ib_sg_dma_address(device->ib_device, [0]);
> >>  reg->sge.length = ib_sg_dma_len(device->ib_device, [0]);
> >
> >what's the role of this hunk? why it belongs here? you are testing
> >device->mr but this is
> >something global and has nothing to do specially with specific IOs for
> >which this patch
> >aims to act
> 
> It's because device->mr might not be allocated at all if
> always_register=Y, however in this case for all-immediatedata writes
> I don't need memory registration and I can use pd->local_dma_lkey.

I'm with Or on this, this is really goofy looking.

This routine probably should not be setting the rkey at all, it makes
no sense to have a routine that returns a lkey and a rkey. Those are
always different flows.

Once that is fixed then the above if can be hoisted to the actual
calling code that needs an rkey, at the point where it does something
different when !iser_always_reg is true..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/ipoib: CSUM support in connected mode

2015-11-23 Thread Jason Gunthorpe
On Wed, Nov 18, 2015 at 10:27:41PM +0200, Yuval Shaia wrote:
> > You need private-data exchange to negotiate the feature.
> > 
> > The feature should be a per-packet csum status header.
> > 
> > When sending a skb that is already fully csumed the receiver sets
> > CHECKSUM_UNNECESSARY.
> > 
> > When sending a skb that has CHECKSUM_PARTIAL then the
> > receiver needs to call skb_partial_csum_set.
> > 
> > Look at how something like VIRTIO_NET_HDR_F_NEEDS_CSUM works and copy
> > that scheme.

> Correct me if i'm wrong here but isn't this protocol assume both parties
> are aware of this special header?

Yes.

No matter what you do both sides must be aware of the change in
protocol, doing it correctly requires adding a small wire header to
flow through the checksum state. This would be enabled once
negotiation confirms both sides will support this.

> My case is a bit different, driver must support backward
> computability in a way that peer maybe a driver that do not support
> this feature and expect packet to be full checksummed.

This is why negotiation is mandatory.

> > DO NOT EVER set CHECKSUM_UNNECESSARY on packets that do not have valid
> > csums - that breaks the net stack.
> The entire idea here is to fake csum offload so how would i tell the stack
> not to run csum on incoming packet?

I already explained this, call skb_partial_csum_set and the stack will
avoid csum work.

> > Yes, you need to add a header to all packets to support this scheme,
> > that is what the private-data negotiation is for.
> > 
> > While you are at it, I'd make room for something like
> > VIRTIO_NET_HDR_GSO_* in the RC protocol too. Implementing GSO
> > forwarding is probably another big performance win.
> If i understood you correctly and you mean exchange of
> "driver-capabilities", then yes, it is there with the extends of
> ipoib_cm_data structure.

You need to dig into how the things I referenced above work, fully
understand them and then adapt them to IPoIB.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/9] IB: add a proper completion queue abstraction

2015-11-23 Thread Jason Gunthorpe
On Sat, Nov 14, 2015 at 08:08:49AM +0100, Christoph Hellwig wrote:
> On Fri, Nov 13, 2015 at 11:25:13AM -0700, Jason Gunthorpe wrote:
> > For instance, like this, not fulling draining the cq and then doing:
> > 
> > > + completed = __ib_process_cq(cq, budget);
> > > + if (completed < budget) {
> > > + irq_poll_complete(>iop);
> > > + if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) {
> > 
> > Doesn't seem entirely right? There is no point in calling
> > ib_req_notify_cq if the code knows there is still stuff in the CQ and
> > has already, independently, arranged for ib_poll_hander to be
> > guarenteed called.
> 
> The code only calls ib_req_notify_cq if it knowns we finished earlier than
> our budget.

Okay, having now read the whole thing, I think I see the flow now. I don't
see any holes in the above, other than it is doing a bit more work
than it needs in some edges cases because it doesn't know if the CQ is
actually empty or not.

> > > + completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
> > > + if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
> > > + ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
> > > + queue_work(ib_comp_wq, >work);
> > 
> > Same comment here..
> 
> Same here - we only requeue the work item if either we processed all of
> our budget, or ib_req_notify_cq with IB_CQ_REPORT_MISSED_EVENTS told
> us that we need to poll again.

I find the if construction hard to read, but yes, it looks OK.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/9] IB: add a proper completion queue abstraction

2015-11-23 Thread Jason Gunthorpe
On Mon, Nov 23, 2015 at 01:54:05PM -0800, Bart Van Assche wrote:

> Not really ... Please have a look at the SRP initiator source code. What the
> SRP initiator does is to poll the send queue before sending a new
> SCSI

I see that. What I don't see is how SRP handles things when the
sendq fills up, ie the case where __srp_get_tx_iu() == NULL. It looks
like the driver starts to panic and generates printks. I can't tell if
it can survive that, but it doesn't look very good..

It would be a lot better if this wasn't allowed to happen, the polling
loop can run until a sendq becomes available, and never return null
from __srp_get_tx_iu().

Ie, __srp_get_tx_iu should look more like

   ib_poll_cq_until(..., !list_empty(>free_tx));

Which would be a fairly sane core API for this direct usage.. Ideally
the core code would sleep if possible and not just spin. Maybe also
protect it with a timer.

> command to the target system starts. I think this approach could also be
> used in other ULP drivers if the send queue poll frequency is such that no
> send queue overflow occurs.

Yes, I agree, but it has to be done properly :)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/9] IB: add a proper completion queue abstraction

2015-11-23 Thread Jason Gunthorpe
On Mon, Nov 23, 2015 at 02:33:05PM -0800, Bart Van Assche wrote:
> On 11/23/2015 02:18 PM, Jason Gunthorpe wrote:
> >On Mon, Nov 23, 2015 at 01:54:05PM -0800, Bart Van Assche wrote:
> >What I don't see is how SRP handles things when the
> >sendq fills up, ie the case where __srp_get_tx_iu() == NULL. It looks
> >like the driver starts to panic and generates printks. I can't tell if
> >it can survive that, but it doesn't look very good..
> 
> Hello Jason,
> 
> From srp_cm_rep_handler():
> 
>   target->scsi_host->can_queue
>   = min(ch->req_lim - SRP_TSK_MGMT_SQ_SIZE,
> target->scsi_host->can_queue);
> 
> In other words, the SCSI core is told to ensure that the number of
> outstanding SCSI commands is one less than the number of elements in the
> ch->free_tx list. And since the SRP initiator serializes task management
> requests it is guaranteed that __srp_get_tx_iu() won't fail due to
> ch->free_tx being empty.

I realize that, and as I already explained, SRP cannot drive the sendq
flow control from the recv side.

The SCSI core considers the command complete and will issue a new
command as soon as the recv completion associated with the command is
returned. (ie when the remote responds)

This *DOES NOT* say anything about the state of the sendq, it does not
guarantee there is send CQ entry available for the associated send, it
does not guarantee there is available space in the sendq. Verbs DOES
NOT make ordering guarentees between queues, even if the queues are
causally related.

This is an important point in verbs and it is commonly done wrong.

So, yes, __srp_get_tx_iu absolutely can fail due to ch->free_tx being
empty, even though by observing the recv side SRP has inferred that
the sendq should have space.

Every ULP has to cope with this, and a direct poll API that doesn't
account for the need to block on a predicate is broken by design.
This is why I object.

'ib_poll_cq_until' would correct all of this.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/9] IB: add a proper completion queue abstraction

2015-11-23 Thread Jason Gunthorpe
On Mon, Nov 23, 2015 at 06:35:28PM -0800, Caitlin Bestler wrote:

> Is it possible for an IB HCA to transmit a response on a QP and not
> in that packet or a previous packet acknowledge something that it
> has delivered to the user?

AFAIK, the rules of ack coalescing do not interact with the send
side. Even if they did, that is the wrong place to look at.

We must look at the receiver. Ordered ack,data on the wire may suffer
a packet loss and the ack may not reach the reciever. In this case can
the reciever detect the lost ack and not progress the data? For IB, it
cannot. The ack sequencing is part of the transmitters recv FSM, and
does not interact with the send FSM.

I feel this a deliberate IB design choice to be simple and efficient
in hardware.

> My recollection of the IB verbs is that they were unlikely to have
> overlooked something like that. If it did slip through then there
> should be an errata.

verbs reflects the wire protocol and the wire protocol has no way to
create a linkage between the send and recv sides of a RC QP. It is not
a spec bug, there is no errata.

> But regardless of specification lawyering, is this an implementation
> issue.

All IB implementations have no choice but to act this way - the wire
protocol provides no way to guarentee ack vs data sequencing at the
receiver, so there is simply no way to guarentee the sendq advances
strictly causally with the recvq.

> Are there actual HCAs that make this mistake?

All IB HCAs have this behavior and require apps to see a send CQ
completion before making any statements about the state of the send Q
or buffers handed over to the HCA. Tom and I have seen this in real
systems under proper stress conditions. [Which is why I am so certain
about this, because when I first hit it years ago I dug into the spec
and figured out it was not a HW bug I was looking at]

This is a direct consequence of how IB runs the ACK protocol.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR

2015-11-23 Thread Jason Gunthorpe
On Mon, Nov 23, 2015 at 10:52:26PM -0800, Christoph Hellwig wrote:
> 
> So at lest for 4.5 we're unlikely to be able to get rid of it alone
> due to the RDS issue.  We'll then need performance numbers for mlx4,
> and figure out how much we care about mthca.

mthca is unfortunately very popular in the used market, it is very
easy to get cards, build a cheap test cluster, etc. :(

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/9] IB: add a proper completion queue abstraction

2015-11-23 Thread Jason Gunthorpe
On Mon, Nov 23, 2015 at 07:34:53PM -0500, Tom Talpey wrote:

> Been there, seen that. Bluescreened on it, mysteriously.

Yes, me too :(

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/9] IB: add a proper completion queue abstraction

2015-11-23 Thread Jason Gunthorpe
On Mon, Nov 23, 2015 at 03:30:42PM -0800, Caitlin Bestler wrote:
>The receive completion can be safely assumed to indicate transmit
>completion over a reliable connection unless your peer has gone
>completely bonkers and is replying to a command that it did not
>receive.

Perhaps iWarp is different and does specify this ordering but IB does
not.

The issue with IB is how the ACK protocol is designed. There is not
strong ordering between ACKs and data transfers. A HCA can send
ACK,DATA and the network could drop the ACK. The recevier side does
not know the ACK was lost and goes ahead to process DATA.

Since only ACK advances the sendq and DATA advances the recvq it is
trivial to get a case where the recvq is advanced with a reply while
the sendq continues to wait for the ACK to be resent.

Further IB allows ACK coalescing and has no rules for how an ACK is
placed. It is entirely valid for a HCA to RECV,REPLY,ACK - for
instance.

>I actually had a bug in an early iWARP emulation where the simulated
>peer, because it was simulated, responded
>instantly. The result was a TCP segment that both acked the
>transmission *and* contained the reply. The bug was
>that the code processed the reception before the transmission ack,
>causing the receive completion to be placed
>on the completion queue before transmit completion.

I don't know if iWARP has the same lax ordering as IB, but certainly,
what you describe is legal for IB verbs to do, and our kernel ULPs
have to cope with it.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method

2015-11-23 Thread Jason Gunthorpe
On Mon, Nov 23, 2015 at 10:45:56PM -0800, Christoph Hellwig wrote:
> On Mon, Nov 23, 2015 at 05:14:14PM -0500, Chuck Lever wrote:
> > In the current xprtrdma implementation, some memreg strategies
> > implement ro_unmap synchronously (the MR is knocked down before the
> > method returns) and some asynchonously (the MR will be knocked down
> > and returned to the pool in the background).
> > 
> > To guarantee the MR is truly invalid before the RPC consumer is
> > allowed to resume execution, we need an unmap method that is
> > always synchronous, invoked from the RPC/RDMA reply handler.
> > 
> > The new method unmaps all MRs for an RPC. The existing ro_unmap
> > method unmaps only one MR at a time.
> 
> Do we really want to go down that road?  It seems like we've decided
> in general that while the protocol specs say MR must be unmapped before
> proceeding with the data that is painful enough to ignore this

That is not my impression, I was thinking we keep finding that ULPs
are not implemented correctly. The various clean up exercises keep
exposing flaws.

The common code is intended to drive RDMA properly.

Async invalidating the rkey is fundamentally a security issue and
should be treated as such. The kernel never trades security for
performance without a user opt in. This is the same logic we've used
for purging the global writable rkey stuff, even though it often had
performance.

> requirement.  E.g. iser for example only does the local invalidate
> just before reusing the MR.

Ugh :(

> I'd like to hear arguments for and against each method instead of
> adding more magic to drivers to either optimize MR performance and
> add clunky workarounds to make it even slower, and instead handled
> the semantics we agreed upo in common code.

Common code should make it easy to do this right, an invalidate of the
MR ordered before the dma unmap, which must complete before the buffer
is handed back to the caller. With easy support for send with
invalidate.

If the common code has an opt-in to make some of these steps run
async, and that gives performance, then fine, but the default should be
secure operation.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/cma: Add a missing rcu_read_unlock()

2015-11-23 Thread Jason Gunthorpe
On Fri, Nov 20, 2015 at 11:04:12AM -0800, Bart Van Assche wrote:
> Ensure that validate_ipv4_net_dev() calls rcu_read_unlock() if
> fib_lookup() fails. Detected by sparse. Compile-tested only.
> 
> Fixes: "IB/cma: Validate routing of incoming requests" (commit f887f2ac87c2).
> Cc: Haggai Eran <hagg...@mellanox.com>
> Cc: stable <sta...@vger.kernel.org>
>  drivers/infiniband/core/cma.c | 5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)

Yep

Reviewed-by: Jason Gunthorpe <jguntho...@obsidianresearch.com>

You say sparse detected this? Mellanox/Or folks - I thought you ran all
the common static stuff on your patches?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/11] IB: provide better access flags for fast registrations

2015-11-23 Thread Jason Gunthorpe
On Sun, Nov 22, 2015 at 06:46:49PM +0100, Christoph Hellwig wrote:
> Instead of the confusing IB spec values provide a flags argument that
> describes:
> 
>   a) the operation we perform the memory registration for, and
>   b) if we want to access it for read or write purposes.
> 
> This helps to abstract out the IB vs iWarp differences as well.

This is so much better, thanks

> +#define IB_REG_LKEY  (ib_reg_scope_t)0x
> +#define IB_REG_RKEY  (ib_reg_scope_t)0x0001

Wrap in () just for convention?

> +static inline int ib_scope_to_access(ib_reg_scope_t scope)
> +{
> + unsigned int acc = 0;
> +
> + if (scope & IB_REG_RKEY) {
> + WARN_ON(scope & IB_REG_OP_SEND);
> +
> + if (scope & IB_REG_OP_RDMA_READ)
> + acc |= IB_ACCESS_REMOTE_READ;
> + if (scope & IB_REG_OP_RDMA_WRITE)
> + acc |= IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE;
> + } else {
> + if (scope & IB_REG_OP_RDMA_READ)
> + acc |= IB_ACCESS_LOCAL_WRITE;

> + }
> +
> + return acc;
> +}
> +
> +static inline int iwarp_scope_to_access(ib_reg_scope_t scope)
> +{

Maybe

unsigned int acc = ib_scope_to_access(scope);
if ((scope & (IB_REG_RKEY | IB_REG_OP_RDMA_READ)) == (IB_REG_RKEY | 
IB_REG_OP_RDMA_READ))
   acc |= IB_ACCESS_LOCAL_WRITE;

return acc;

Makes it a bit clearer what the only difference is.

Is this enough to purge the cap test related to this?

Jaason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: garbage collect old memory registration code

2015-11-23 Thread Jason Gunthorpe
On Sun, Nov 15, 2015 at 07:05:53PM +0100, Christoph Hellwig wrote:
> This series removes huge chunks of code related to old memory
> registration methods that we don't use anymore.
> 
> This expects my "IB: merge struct ib_device_attr into struct ib_device"
> patch to be already applied.
> 
> Also available as a git tree:
> 
>   
> http://git.infradead.org/users/hch/rdma.git/shortlog/refs/heads/rdma-mr-cleanups
>   git://git.infradead.org/users/hch/rdma.git rdma-mr-cleanups

Only looked carefully at the core changes here, but they look good to
me. Thanks for doing this.

Reviewed-by: Jason Gunthorpe <jguntho...@obsidianresearch.com>
(for core bits)

Driver changes look pretty much trivial too me.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V1 5/9] IB/core: Add rdma_network_type to wc

2015-11-23 Thread Jason Gunthorpe
> + /* Use the hint from IP Stack to select GID Type */
> + network_gid_type = ib_network_to_gid_type(addr->dev_addr.network);
> + if (addr->dev_addr.network != RDMA_NETWORK_IB) {
> + route->path_rec->gid_type = network_gid_type;
> + /* TODO: get the hoplimit from the inet/inet6 device */
> + route->path_rec->hop_limit = IPV6_DEFAULT_HOPLIMIT;

Uh, that is more than a TODO, that is showing this is all messed up.

It isn't just the hop limit that has to come from the route entry, all
the source information of the path comes from there. Ie the gid table
should accept the route entry directly and spit out the sgid_index.

The responder side is the same, it also needs to do a route lookup to
figure out what it is doing, and that may not match what the rx says
from the headers. This is important stuff.

I really don't like the API changes that went in with the last series
that added net_dev and gid_attr everywhere, that just seems to be
enabling mistakes like the above. You can't use rocev2 without doing
route lookups, providing APIs that don't force this to happen just
encourages broken flows like this.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V1 3/9] IB/core: Add gid attributes to sysfs

2015-11-23 Thread Jason Gunthorpe
On Thu, Oct 15, 2015 at 07:07:06PM +0300, Matan Barak wrote:
> This patch set adds attributes of net device and gid type to each GID
> in the GID table. Users that use verbs directly need to specify
> the GID index. Since the same GID could have different types or
> associated net devices, users should have the ability to query the
> associated GID attributes. Adding these attributes to sysfs.

There is no addition to Documentation/ABI/ for this change

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V1 7/9] IB/cma: Add configfs for rdma_cm

2015-11-23 Thread Jason Gunthorpe
On Thu, Oct 15, 2015 at 07:07:10PM +0300, Matan Barak wrote:
> Users would like to control the behaviour of rdma_cm.
> For example, old applications which don't set the
> required RoCE gid type could be executed on RoCE V2
> network types. In order to support this configuration,
> we implement a configfs for rdma_cm.
> 
> In order to use the configfs, one needs to mount it and
> mkdir  inside rdma_cm directory.
> 
> The patch adds support for a single configuration file,
> default_roce_mode. The mode can either be "IB/RoCE v1" or
> "RoCE v2".

Also no Documentation/ABI stuff for this

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V1 9/9] IB/cma: Join and leave multicast groups with IGMP

2015-11-23 Thread Jason Gunthorpe
On Thu, Oct 15, 2015 at 07:07:12PM +0300, Matan Barak wrote:
> diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h
> index 0a40ed2..5bea0e8 100644
> +++ b/include/rdma/ib_sa.h
> @@ -206,6 +206,9 @@ struct ib_sa_mcmember_rec {
>   u8   scope;
>   u8   join_state;
>   int  proxy_join;
> + int  ifindex;
> + struct net  *net;
> + enum ib_gid_type gid_type;
>  };

This is really gross.

Make ib_init_ah_from_mcmember accept a QP and get the above stuff from
the QP.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/9] IB: add a proper completion queue abstraction

2015-11-23 Thread Jason Gunthorpe
On Mon, Nov 23, 2015 at 01:04:25PM -0800, Bart Van Assche wrote:

> Considerable time ago the send queue in the SRP initiator driver was
> modified from signaled to non-signaled to reduce the number of interrupts
> triggered by the SRP initiator driver. The SRP initiator driver polls the
> send queue every time before a SCSI command is sent to the target. I think
> this is a pattern that is also useful for other ULP's so I'm not convinced
> that ib_process_cq_direct() should be deprecated :-)

As I explained, that is a fine idea, but I can't see how SRP is able
to correctly do sendq flow control without spinning on the poll, which
it does not do.

I'm guessing SRP is trying to drive sendq flow control from the recv
side, like NFS was. This is wrong and should not be part of the common
API.

Does that make sense?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/9] IB: add a proper completion queue abstraction

2015-11-13 Thread Jason Gunthorpe
On Fri, Nov 13, 2015 at 02:46:43PM +0100, Christoph Hellwig wrote:
> This adds an abstraction that allows ULP to simply pass a completion
> object and completion callback with each submitted WR and let the RDMA
> core handle the nitty gritty details of how to handle completion
> interrupts and poll the CQ.

This looks pretty nice, I'd really like to look it over carefully
after SC|15..

I know Bart and others have attempted to have switching between event
and polling driven operation, but there were problems resolving the
races. Would be nice to review that conversation.. Do you remember the
details Bart?

> +static int __ib_process_cq(struct ib_cq *cq, int budget)
> +{
> + int i, n, completed = 0;
> +
> + while ((n = ib_poll_cq(cq, IB_POLL_BATCH, cq->wc)) > 0) {
> + completed += n;
> + if (completed >= budget)
> + break;

For instance, like this, not fulling draining the cq and then doing:

> + completed = __ib_process_cq(cq, budget);
> + if (completed < budget) {
> + irq_poll_complete(>iop);
> + if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) {

Doesn't seem entirely right? There is no point in calling
ib_req_notify_cq if the code knows there is still stuff in the CQ and
has already, independently, arranged for ib_poll_hander to be
guarenteed called.

> + if (!irq_poll_sched_prep(>iop))
> + irq_poll_sched(>iop);

Which, it seems, is what this is doing.

Assuming irq_poll_sched is safe to call from a hard irq context, this
looks sane, at first glance.

> + completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
> + if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
> + ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
> + queue_work(ib_comp_wq, >work);

Same comment here..

> +static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
> +{
> + queue_work(ib_comp_wq, >work);

> + switch (cq->poll_ctx) {
> + case IB_POLL_DIRECT:
> + cq->comp_handler = ib_cq_completion_direct;
> + break;
> + case IB_POLL_SOFTIRQ:
> + cq->comp_handler = ib_cq_completion_softirq;
> +
> + irq_poll_init(>iop, IB_POLL_BUDGET_IRQ, ib_poll_handler);
> + irq_poll_enable(>iop);
> + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
> + break;

I understand several drivers are not using a hard irq context for the
comp_handler call back. Is there any way to exploit that in this new
API so we don't have to do so many context switches? Ie if the driver
already is using a softirq when calling comp_handler can we somehow
just rig ib_poll_handler directly and avoid the overhead? (Future)

At first glance this seems so much saner than what we have..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/mad: In validate_mad, validate CM method and attribute

2015-11-12 Thread Jason Gunthorpe
On Thu, Nov 12, 2015 at 08:30:55PM +, Hefty, Sean wrote:
> > > + /* CM attributes other than ClassPortInfo only use Send method
> > */
> > > + if (mad_hdr->mgmt_class == IB_MGMT_CLASS_CM) {
> > > + if (mad_hdr->attr_id != IB_MGMT_CLASSPORTINFO_ATTR_ID) {
> > > + if (mad_hdr->method != IB_MGMT_METHOD_SEND)
> > > + goto out;
> > > + } else if (mad_hdr->method != IB_MGMT_METHOD_GET_RESP)
> > > + goto out;
> > > + }
> > 
> > Doesn't this invalidate a CM Get(ClassPortInfo) mad?
> 
> I believe this does.  I think you could remove the else if clause
> and let the received MAD get passed to the CM.  It would be dropped
> there as unsupported.  The net result is likely the same.

IIRC responding to Get(CPI) is mandatory?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: [PATCH] svcrdma: Do not send XDR roundup bytes for a write chunk

2015-11-12 Thread Jason Gunthorpe
On Thu, Nov 12, 2015 at 09:47:13AM -0500, Chuck Lever wrote:
> I wish git send-mail had an address book. I’m really tired
> of misspelling the to/cc addresses on patches.

It does:

CONFIGURATION
   sendemail.aliasesfile
   To avoid typing long email addresses, point this to one or more 
email aliases files. You must also supply
   sendemail.aliasfiletype.

   sendemail.aliasfiletype
   Format of the file(s) specified in sendemail.aliasesfile. Must be 
one of mutt, mailrc, pine, elm, or gnus.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 2/3] svcrdma: Use device rdma_read_access_flags

2015-11-11 Thread Jason Gunthorpe
On Wed, Nov 11, 2015 at 11:25:06AM +0200, Sagi Grimberg wrote:
> For RDMA READs, a HCA will perform the memory scatters when on the RX
> path, when receiving the read responses containing the data. This means
> that the HCA needs to perform a lookup of the relevant scatter entries
> upon each read response. Due to that, modern HCAs keep a dedicate cache
> for this type of RX-path lookup (which is limited in size naturally).

There is always a memory scatter, and MR setup overheads and table
reads are not zero. There is nothing architectural in the verbs or IBA
that would require a HCA to run slower for SGL than MR...

As you say, another choice the ULP should not be making.. Even
selecting the optimal SGL list len is not something the ULP should
do, in that case.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 2/3] svcrdma: Use device rdma_read_access_flags

2015-11-11 Thread Jason Gunthorpe
On Wed, Nov 11, 2015 at 12:02:25AM -0800, Christoph Hellwig wrote:
> > No need for the else, !rdma_cap_needs_rdma_read_mr means pd->local_dma_lkey 
> > is okay
> > to use.
> 
> What would happen if someone tried to use NFS on usnic without this?

Honestly, I have no idea how usnic fits into the kernel side, it
simply doesn't provide the kapi. It should fail much earlier, like
when creating a PD, IMHO.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/3] IB/core: Expose a device attribute for rdma_read access flags

2015-11-10 Thread Jason Gunthorpe
On Tue, Nov 10, 2015 at 12:51:10PM +0100, Yann Droneaud wrote:
> Why were those hw providers not modified to
> enforce IB_ACCESS_REMOTE_WRITE when needed, instead of asking users to
> set it for them ?

iWarp hardware simply cannot do it it all.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH ib-next 2/3] IB/core: IB/core: Allow legacy verbs through extended interfaces

2015-11-10 Thread Jason Gunthorpe
On Tue, Nov 10, 2015 at 12:25:41PM +0200, Matan Barak wrote:
> The kernel will do the above fallback if the command is a legacy
> command wrapped in an extended command (i.e - no extra flags).
> If an application uses one of the extended values, and fall back to
> legacy verb on ENOSYS, it'll behave differently after this change:
> Re-posting this example:

Again, that doesn't seem like a likely case today, the extended verbs
added so far don't strike me as optional, and even so, if someone is
coding that kind of fall back it is incorrect to just look for ENOSYS,
trying to turn on an optional feature could fail for many different
reasons.

If someone is actually doing this, then we can work around it, but
the kernel seems to be a bit more fluid on error returns - probably
because people get them so wrong so often :(

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/3] IB/core: Expose a device attribute for rdma_read access flags

2015-11-10 Thread Jason Gunthorpe
On Tue, Nov 10, 2015 at 12:44:13PM +0200, Sagi Grimberg wrote:
> Different devices may require different access permissions
> for rdma reads e.g. for Infiniband devices, local write access
> suffices while iWARP devices require remote write permissions as
> well.
> 
> This situation generates extra logic for ULPs that need to be aware
> of the underlying device to determine rdma read access. Instead,
> add a device attribute for ULPs to use without being aware of the
> underlying device.
> 
> Also, set rdma_read_access_flags in the relevant device drivers:
> mlx4, mlx5, qib, ocrdma, nes: IB_ACCESS_LOCAL_WRITE
> cxgb3, cxgb4, hfi: IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE

IB devices *should* be using a local_dma_lkey instead of a memory
registration for the RDMA READ target memory.

iWarp devices *must* use a memory registration instead.

Having an API that encourages ULPs to do useless MR work on IB does
not seem like a great idea, but if the ULP needs to create a MR anyhow
(SG limits or whatever), then it makes some sense.. But not in absence
of the 'need a mr' general check..

Finally, please don't add rdma_read_access_flags - we went to a lot of
trouble to add the cap stuff, and this stuff is already covered by it.
No need to change every driver.

I'd suggest something like

  unsigned int rdma_cap_rdma_read_mr_flags(const struct ib_pd *pd)
  {
 if (rdma_protocol_iwarp(pd->device, rdma_start_port(pd->device)))
   return IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
 return IB_ACCESS_LOCAL_WRITE;
  }

  bool rdma_cap_need_rdma_read_mr(const struct ib_pd *pd) ...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB: start documenting device capabilities

2015-11-10 Thread Jason Gunthorpe
On Tue, Nov 10, 2015 at 01:19:43PM +0100, Christoph Hellwig wrote:
> Just IB_DEVICE_LOCAL_DMA_LKEY and IB_DEVICE_MEM_MGT_EXTENSIONS for now
> as I'm most familar with those.
> 
> Signed-off-by: Christoph Hellwig <h...@lst.de>

Looks right to me

Reviewed-By: Jason Gunthorpe <jguntho...@obsidianresearch.com>

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 ib-next 2/3] IB/core: IB/core: Allow legacy verbs through extended interfaces

2015-11-10 Thread Jason Gunthorpe
On Tue, Nov 10, 2015 at 08:00:09PM +0200, Eli Cohen wrote:
> When an extended verbs is an extension to a legacy verb, the original
> functionality is preserved. Hence we do not require each hardware driver
> to set the extended capability. This will allow to use the extended verb
> in its simple form with drivers that do not support the extended
> capability.

Can we please get rid of uverbs_cmd_mask and uverbs_ex_cmd_mask ?

The driver should set the function pointers it does not support to
NULL. We don't need the bitmask because we already know the answer.

For performance, I recommend having the core replace the NULL op
pointers from the driver with a stub that returns ENOTSUP, and then
drop all the checking in the fast path.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/3] IB/core: Expose a device attribute for rdma_read access flags

2015-11-10 Thread Jason Gunthorpe
On Tue, Nov 10, 2015 at 05:41:47AM -0800, Christoph Hellwig wrote:
> FYI, this is the API I'd aim for (only SRP and no HW driver converted
> yet):

>   n = ib_map_mr_sg(desc->mr, state->sg, state->sg_nents,
> -  dev->mr_page_size);
> +  dev->mr_page_size,
> +  /*
> +   * XXX: add a bool write argument to this function,
> +   * so that we only need to open up the required
> +   * permissions.
> +   */
> +  IB_MR_REMOTE | IB_MR_RDMA_READ |
> IB_MR_RDMA_WRITE);

I would call it IB_RDMA_LKEY and IB_RDMA_RKEY. We have other places in
the API where lkey/rkey is used and it makes a lot more sense to think
about a MR as being either a lkey or rkey usable MR - since this is
effectively what we are doing here with these ops.

IB_MR_RKEY | IB_MR_RDMA_READ == The resulting key can be used
in a wr.rdma.rkey field

IB_MR_LKEY | IB_MR_SEND == The key can be used in wr.sg_list[].lkey

etc

It is an error to use a IB_MR_LKEY in a rkey field, etc..

> +enum ib_mr_flags {
> + /* scope: either remote or local */
> + IB_MR_REMOTE,
> + IB_MR_LOCAL,
> +
> + /* direction: one or both can be ORed into the scope above */
> + IB_MR_RDMA_READ = (1 << 10),
> + IB_MR_RDMA_WRITE= (1 << 11)

Don't forget SEND too.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 2/3] svcrdma: Use device rdma_read_access_flags

2015-11-10 Thread Jason Gunthorpe
On Tue, Nov 10, 2015 at 04:04:32AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 10, 2015 at 01:46:40PM +0200, Sagi Grimberg wrote:
> > 
> > 
> > On 10/11/2015 13:41, Christoph Hellwig wrote:
> > >Oh, and while we're at it.  Can someone explain why we're even
> > >using rdma_read_chunk_frmr for IB?  It seems to work around the
> > >fact tat iWarp only allow a single RDMA READ SGE, but it's used
> > >whenever the device has IB_DEVICE_MEM_MGT_EXTENSIONS, which seems
> > >wrong.
> > 
> > I think Steve can answer it better than I can. I think that it is
> > just to have a single code path for both IB and iWARP. I agree that
> > the condition seems wrong and for small transfers rdma_read_chunk_frmr
> > is really a performance loss.
> 
> Well, the code path already exists, but only is used fi
> IB_DEVICE_MEM_MGT_EXTENSIONS isn't set.  Below is an untested patch
> that demonstrates how I think svcrdma should setup the reads.  Note
> that this also allows to entirely remove it's allphys MR.
> 
> Note that as a followon this would also allow to remove the
> non-READ_W_INV code path from rdma_read_chunk_frmr as a future
> step.

I like this, my only comment is we should have a rdma_cap for this
behavior, rdma_cap_needs_rdma_read_mr(pd) or something?

> + if (rdma_protocol_iwarp(dev, newxprt->sc_cm_id->port_num)) {

Use here

> + /*
> +  * iWARP requires remote write access for the data sink, and
> +  * only supports a single SGE for RDMA_READ requests, so we'll
> +  * have to use a memory registration for each RDMA_READ.
> +  */
> + if (!(dev->device_cap_flags &
> IB_DEVICE_MEM_MGT_EXTENSIONS)) {

Lets enforce this in the core, if rdma_cap_needs_rdma_read_mr is set
the the device must also set IB_DEVICE_MEM_MGT_EXTENSIONS, check at
device creation time.

> + } else if (rdma_ib_or_roce(dev, newxprt->sc_cm_id->port_num)) {
> + /*
> +  * For IB or RoCE life is easy, no unsafe write access is
> +  * required and multiple SGEs are supported, so we don't need
> +  * to use MRs.
> +  */
> + newxprt->sc_reader = rdma_read_chunk_lcl;
> + } else {
> + /*
> +  * Neither iWarp nor IB-ish, we're out of luck.
> +  */
>   goto errout;

No need for the else, !rdma_cap_needs_rdma_read_mr means pd->local_dma_lkey is 
okay
to use.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH ib-next 2/3] IB/core: IB/core: Allow legacy verbs through extended interfaces

2015-11-10 Thread Jason Gunthorpe
On Tue, Nov 10, 2015 at 05:40:26PM +0200, Eli Cohen wrote:
> On Tue, Nov 10, 2015 at 05:23:10PM +0200, Eli Cohen wrote:
> > > 
> > > Call it ENOTSUP then:
> > > 
> > >ENOTSUP Operation not supported (POSIX.1)
> > > 
> > > Same value on Linux.
> > 
> > Yes, that's better. I will resend.
> 
> 
> Well it seems like ENOTSUP is defined only here:
> 
> ./arch/parisc/include/uapi/asm/errno.h:115:#define ENOTSUP 252 /* 
> Function not implemented (POSIX.4 / HPUX) */

> Which obviously I cannot use. Jason, do you have another idea?

I would stick with EOPNOTSUPP in the kernel and userspace can view
that as ENOTSUP. My remark was more along the lines that EOPNOTSUPP is
not socket specific because it encompasses ENOTSUP as well on Linux,
due to having the same value.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 ib-next 2/3] IB/core: IB/core: Allow legacy verbs through extended interfaces

2015-11-10 Thread Jason Gunthorpe
On Tue, Nov 10, 2015 at 09:57:13PM +0200, Eli Cohen wrote:

> Yes, we can do this but I think this should be the subject for another
> patch, agree?

Sure

> Regarding using stabs, it may be nice but I don't think performance is
> the issue here. Most verbs implementations involve relatively long i/o
> operations anyway so the gain will not be noticable.

Intel keeps optimizing this common path since they use it for post..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH ib-next 2/3] IB/core: IB/core: Allow legacy verbs through extended interfaces

2015-11-10 Thread Jason Gunthorpe
On Tue, Nov 10, 2015 at 10:02:26PM +0200, Eli Cohen wrote:
> On Tue, Nov 10, 2015 at 11:23:35AM -0800, Bart Van Assche wrote:
> > 
> > How about using ENOTSUPP ?
> > 
> > $ PAGER= git grep 'define ENOTSUPP' include
> > include/linux/errno.h:#define ENOTSUPP 524 /* Operation is not supported */
> > 
> 
> Appearently this looks a better option but the following appears as a
> icomment above this group: /* Defined for the NFSv3 protocol */
> 
> I don't mind using this if we can all agree that this is the best
> choise.

We cannot use ENOTSUPP, it is not supported by userspace, there was a
list discussion about that a while back.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 2/3] svcrdma: Use device rdma_read_access_flags

2015-11-10 Thread Jason Gunthorpe
On Tue, Nov 10, 2015 at 04:00:43PM -0500, Tom Talpey wrote:

> Hmm, agreed, but it must still be acceptable to perform a registration
> instead of using the local_dma_lkey. As mentioned earlier, there are
> scatter limits and other implications for all-physical addressing that
> an upper layer may choose to avoid.

It is always acceptable to use a lkey MR instead of the local dma
lkey, but ULPs should prefer to use the local dma lkey if possible,
for performance reasons.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 2/3] svcrdma: Use device rdma_read_access_flags

2015-11-10 Thread Jason Gunthorpe
On Tue, Nov 10, 2015 at 06:01:01PM -0500, Chuck Lever wrote:
> The current mechanism of statically choosing either FRMR or
> local DMA depending on the device is probably not adequate.
> Maybe we could post all the Read WRs via a single chain? Or
> stick with FRMR when the amount of data to read is significant.

Yes, those are the right ways to go..

IMHO, the break point for when it makes sense to switch from a RDMA
READ chain to a MR is going to be a RDMA core tunable. The performance
curve won't have much to do with the ULP.

But certainly, if a requests fits in the SG list then it should just
use the local dma lkey directly, and consider allocating an
appropriate S/G list size when creating the QP.

The special thing about iwarp becomes simply defeating the use of the
local dma lkey for RDMA READ.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH ib-next 2/3] IB/core: IB/core: Allow legacy verbs through extended interfaces

2015-11-09 Thread Jason Gunthorpe
On Tue, Nov 10, 2015 at 01:24:31AM +0200, Eli Cohen wrote:
> > Also, when the driver tests the ex flags for support it should be
> > returning EOPNOTSUPP or such not EINVAL.. Return codes for the ex
> > stuff could stand a good sanity audit.
> > 
> 
> #define EOPNOTSUPP  95  /* Operation not supported on
> transport endpoint */
> 
> This does not seem like an ideal choise either. I think ENOSYS in this
> case is a better choise.

Call it ENOTSUP then:

   ENOTSUP Operation not supported (POSIX.1)

Same value on Linux.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 1/4] IB/mlx5: Add create_cq extended command

2015-11-09 Thread Jason Gunthorpe
On Mon, Nov 09, 2015 at 06:30:54PM +0200, Matan Barak wrote:
> @@ -1385,7 +1385,8 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
>   (1ull << IB_USER_VERBS_CMD_CREATE_XSRQ) |
>   (1ull << IB_USER_VERBS_CMD_OPEN_QP);
>   dev->ib_dev.uverbs_ex_cmd_mask =
> - (1ull << IB_USER_VERBS_EX_CMD_QUERY_DEVICE);
> + (1ull << IB_USER_VERBS_EX_CMD_QUERY_DEVICE) |
> + (1ull << IB_USER_VERBS_EX_CMD_CREATE_CQ);

Eli posted a series that gets rid of this stuff, can you please
coordinate?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH ib-next 2/3] IB/core: IB/core: Allow legacy verbs through extended interfaces

2015-11-09 Thread Jason Gunthorpe
On Mon, Nov 09, 2015 at 08:48:36AM +0200, Haggai Eran wrote:
> On 08/11/2015 17:04, Matan Barak wrote:
> >> @@ -704,6 +719,10 @@ static ssize_t ib_uverbs_write(struct file *filp, 
> >> const char __user *buf,
> >> > }
> >> >
> >> > command = hdr.command & IB_USER_VERBS_CMD_COMMAND_MASK;
> >> > +   if (verify_command_mask(ib_dev, command)) {
> >> > +   ret = -EINVAL;
> > Why did you replace ENOSYS with EINVAL?
> 
> I think ENOSYS is meant only to represent a system call number that
> isn't supported. There was a change in checkpatch that now warns about
> it [1]. I'm not sure the intention was to fix existing uses though.

Within verbs we should have two kinds of return - not supported
request, and supported request with invalid parameters.

Maybe use EOPNOTSUPP ?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH ib-next 2/3] IB/core: IB/core: Allow legacy verbs through extended interfaces

2015-11-09 Thread Jason Gunthorpe
On Tue, Nov 10, 2015 at 01:05:31AM +0200, Eli Cohen wrote:
> On Mon, Nov 09, 2015 at 03:35:31PM -0700, Jason Gunthorpe wrote:
> > On Mon, Nov 09, 2015 at 08:48:36AM +0200, Haggai Eran wrote:
> > > On 08/11/2015 17:04, Matan Barak wrote:
> > > >> @@ -704,6 +719,10 @@ static ssize_t ib_uverbs_write(struct file *filp, 
> > > >> const char __user *buf,
> > > >> > }
> > > >> >
> > > >> > command = hdr.command & IB_USER_VERBS_CMD_COMMAND_MASK;
> > > >> > +   if (verify_command_mask(ib_dev, command)) {
> > > >> > +   ret = -EINVAL;
> > > > Why did you replace ENOSYS with EINVAL?
> > > 
> > > I think ENOSYS is meant only to represent a system call number that
> > > isn't supported. There was a change in checkpatch that now warns about
> > > it [1]. I'm not sure the intention was to fix existing uses though.
> > 
> > Within verbs we should have two kinds of return - not supported
> > request, and supported request with invalid parameters.
> > 
> > Maybe use EOPNOTSUPP ?
> > 
> 
> What about Matan's concern regarding legacy code? Maybe we should
> leave ENOSYS as that's how it is all over the IB stack code.
> 
> quote:
> I think it could break old applications:
> err = extended_verb(the_first_extension_we_added);
> if (err == ENOSYS)
>   err = legacy_verb();
> if (err)
> return err;
> 
> Such applications used the first extension (that was added during the
> addition of the extended verb) and when they realized it's not
> supported, they dropped to the legacy verb. This change can now cause
> the return of -EINVAL an early termination with an error.

Since the change is to make the kernel do the above fall back
internally, this specific example doesn't make alot of sense to worry
about. Ie the extended verb won't fail anymore, and if it does the
legacy one won't work anyhow.

But if there is something out there that does care about ENOSYS we
should try to keep it, but don't convert ENOSYS to EINVAL.

Also, when the driver tests the ex flags for support it should be
returning EOPNOTSUPP or such not EINVAL.. Return codes for the ex
stuff could stand a good sanity audit.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/core: use RCU for uverbs id lookup

2015-11-02 Thread Jason Gunthorpe
On Mon, Nov 02, 2015 at 12:13:25PM -0500, Mike Marciniszyn wrote:
> The current implementation gets a spin_lock, and at any scale with
> qib and hfi1 post send, the lock contention grows exponentially
> with the number of QPs.
> 
> idr_find() is RCU compatibile, so read doesn't need the lock.
> 
> Change to use rcu_read_lock() and rcu_read_unlock() in
> __idr_get_uobj().
> 
> kfree_rcu() is used to insure a grace period between the
> idr removal and actual free.

Looks OK to me.

Reviewed-By: Jason Gunthorpe <jguntho...@obsidianresearch.com>

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: shrink struct ib_send_wr V4

2015-11-02 Thread Jason Gunthorpe
On Mon, Nov 02, 2015 at 07:02:04PM -0500, Doug Ledford wrote:

> Because they are *scheduled* for removal.  If I simply didn't care if
> they went away, then I wouldn't screw around with deprecating them or
> tagging them to be removed, I'd just delete them.  Breaking them before
> the scheduled removal time defeats that entire purpose.

You don't see in the compiles, but IIRC ipath is unusable after the
PAT rework.

We've also been ripping out the ULP side of things that would let
amso1100/ehca work in any meningful way with the kernel ULPs.

The reason I lobbied to get rid of them is specifically because they
don't work and maintaining them (ie the driver and the single-use ULP
codepath side) is a huge pain. Keeping them around and keeping them
compiling defeats the entire point. Just delete them, we don't need to
wait for 4.6.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: shrink struct ib_send_wr V4

2015-11-02 Thread Jason Gunthorpe
On Mon, Nov 02, 2015 at 07:52:05PM -0500, Doug Ledford wrote:
> It shouldn't be.  I reviewed those changes and they looked right (given
> the limitations).  All you needed was to boot with nopat on the kernel
> command line to get the old kernel behavior and it would continue to
> work as before, and it would print out a message telling you to do so if
> you hadn't already.

Alright, that was done after it was pretty clear the driver was
useless and I stopped looking at that part of the series. I don't know
why we made Luis jump through such hoops instead of just deleting it
right then and there.

> > For amso1100, kernel ULPs has never been its target.  Didn't we only
> > recently got any support for iWARP in iSER?

I don't like that reasoning. In 4.4 we expect some ULPs to work on
iWarp, and amso110 can't do it. That is a big reason why the driver
is getting chopped.

> > The reason I lobbied to get rid of them is specifically because they
> > don't work and maintaining them (ie the driver and the single-use ULP
> > codepath side) is a huge pain. Keeping them around and keeping them
> > compiling defeats the entire point. Just delete them, we don't need to
> > wait for 4.6.
> 
> That's not true.  User space continues to work, and amso1100 shouldn't
> be greatly negatively impacted by recent changes.  Nor should ipath.

So what if user space works? The kernel consumers are known broken.

We've commited to removing the driver from the kernel. We have support
of the driver authors to do this. We have found no users or
testers. We've committed to removing the ULP support (ie MR-only code
is being ripped out).

*WHY* spend an ounce of time fixing up code that *NO-ONE* will ever
even run? Wasted effort. Just delete them now.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   8   9   10   >