Re: [ofa-general] Cannot export multiple directories using nfs-rdma
Jeff Johnson wrote: I have a nfs-rdma configuration using Mellanox ConnectX-DDR, ofed-1.4.2 on Centos 5.3 x86_64. My ConnectX cards are running 2.5.0 firmware as I have read that 2.6.0 had rdma issues. I saw these issues and down rev'd the cards to 2.5.0. I am seeing a peculiar behavior where if I export two separate directories from the server and attempt to mount them separately from a client I end up with the same export mounted to two different client directories. Hi Jeff: The mount service does not run over RDMA, it only runs on TCP/UDP. I believe you should be able to reproduce this behavior on plain old GigE/IPoIB. Is this the case? Tom e.g.:server:/raid1 server:/raid2 'mount.rnfs 10.0.0.251:/raid1 /raid1 -i -o rdma,port=2050' client:/raid1 ---has server:/raid1 contents 'mount.rnfs 10.0.0.251:/raid2 /raid2 -i -o rdma,port=2050' client:/raid2 ---has server:/raid1 contents I have tried creating multiple rdma ports on the server (2050 and 2051) and then using different ports for each separate mount. The result is the same. I have verified that I am indeed mounting rdma and not merely ipoib. Is nfs-rdma capable of multiple exports? If so, I cannot find a method for dealing with multiple exports from the server or client side in any ofed docs. Thanks for any assistance.. -- Jeff Johnson Manager Aeon Computing jeff.john...@aeoncomputing.com t: 858-412-3810 f: 858-412-3845 m: 619-204-9061 4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117 ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: NFSRDMA connectathon prelim. testing status,
Vu: What memory registration model are you using? Vu Pham wrote: Hi Tom, I have both nfsrdma client and server on 2.6.29-rc5 kernel, nfs-utils-1.1.4. I'm using both Infinihost III (ib_mthca) and ConnectX (mlx4_ib) HCAs I have seen several problems during my testing at NFS Connectathon 2009 1. When I used ConnectX (mlx4_ib) HCAs on both client and server, the client can not mount. Talking to Tom Talpey and scanning the code, I saw that xprtrdma module is using ib_reg_phys_mr() and mlx4_ib verbs provider does not have the implementation for this verb. If I have client on mlx4_ib and server on ib_mthca, I hit the following crash because of bad error handling in xprtrdma (see file attached - mlx4_mount_problem.log) Because of this problem, I use InfiniHost III (ib_mthca) for all of my tests at Connectathon 2. Testing Linux nfsrdma client against both Linux and OpenSolaris nfsrdma servers, I hit the process hung problem during the connectathon's lock test (seeing sync_page_1.log and sync_page_2.log attached files). I can only reproduce it when I ran connectathon more than 500 iterations (-N 1000) I can NOT reproduce the problem with nfs client/server over IPoIB 3. Testing openSolaris nfsrdma client against linux nfsrdma server, I hit the following BUG_ON() right away(see file attached - svcrdma_send.log) thanks, -vu ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [Fwd: RE: [ofa-general] NFS-RDMA (OFED1.4) with standard distributions ?]
Jeff: Unfortunately, the NFSRDMA transport cannot make your disks go faster. If the storage subsystem is incapable of keeping up with IPoIB, then it won't be able to keep up with NFSRDMA either. To compare NFSRDMA and IPoIB performance absent a very fast storage subsystem you'll need to keep the file sizes small enough such that they fit within the server cache. Tom Jeff Becker wrote: Hi. Just passing this on in case you missed it. Do you have any advice on what knobs to tweak to get better performance (than NFS/IPoIB)? Thanks. -jeff Original Message Subject:RE: [ofa-general] NFS-RDMA (OFED1.4) with standard distributions ? Date: Mon, 10 Nov 2008 16:27:50 + From: Ciesielski, Frederic (EMEA HPCOSLO CC) [EMAIL PROTECTED] To: Jeff Becker [EMAIL PROTECTED] CC: general@lists.openfabrics.org general@lists.openfabrics.org References: [EMAIL PROTECTED] [EMAIL PROTECTED] That's great, thanks. I ran some tests with the 2.6.27 kernel as server and client, and basically it works fine. I could not find yet any situation where NFS-RDMA would outperform NFS/IPoIB, at least when you compare apples to apples (same clients, same server, same protocol, and not just write to/read from the caches), and it even seems to have severe performance issues for reading with files larger than the memory size of the client and the server. Hopefully this will improve when more users will be able to give valuable feedback... Fred. -Original Message- From: Jeff Becker [mailto:[EMAIL PROTECTED] Sent: Saturday, 08 November, 2008 22:35 To: Ciesielski, Frederic (EMEA HPCOSLO CC) Cc: general@lists.openfabrics.org Subject: Re: [ofa-general] NFS-RDMA (OFED1.4) with standard distributions ? Ciesielski, Frederic (EMEA HPCOSLO CC) wrote: Is there any chance that the new NFS-RDMA features coming with OFED 1.4 work with standard and current distributions, like RHEL5, SLES10 ? Not yet, but I'm working on it. I intend for NFSRDMA to work on 2.6.27 and 2.6.26 for OFED 1.4. The RHEL5 and SLES10 backports will likely be done for OFED 1.4.1. Thanks. -jeff Did anybody test this, or would pretend it is supposed to work ? I mean without building a 2.6.27 or equivalent kernel on top of it, keeping almost full support from the vendors. Enhanced kernel modules may not be sufficient to work around the limitations of old kernels... -- -- ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 00/03] RDMA Transport Support for 9P
Roland: This patchset implements an RDMA transport provider for the v9fs (Plan 9 filesystem). Could you take a look at it and let us know what you think? Thanks, Tom Here is the original posting... Eric: This patch series implements an RDMA Transport provider for 9P and is relative to your for-next branch. The RDMA support is built on the OpenFabrics API and uses SEND and RECV to exchange data. This patch series has been tested with dbench and iozone. Signed-off-by: Tom Tucker [EMAIL PROTECTED] Signed-off-by: Latchesar Ionkov [EMAIL PROTECTED] [PATCH 01/03] 9prdma: RDMA Transport Support for 9P net/9p/trans_rdma.c | 996 +++ 1 files changed, 996 insertions(+), 0 deletions(-) [PATCH 02/03] 9prdma: Makefile change for the RDMA transport net/9p/Makefile |4 1 files changed, 4 insertions(+), 0 deletions(-) [PATCH 03/03] 9prdma: Kconfig changes for the RDMA transport net/9p/Kconfig |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 02/03] 9prdma: Makefile change for the RDMA transport
This adds a make rule for the 9pnet_rdma module that implements the RDMA transport. Signed-off-by: Tom Tucker [EMAIL PROTECTED] Signed-off-by: Latchesar Ionkov [EMAIL PROTECTED] --- net/9p/Makefile |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/net/9p/Makefile b/net/9p/Makefile index 5192194..bc909ab 100644 --- a/net/9p/Makefile +++ b/net/9p/Makefile @@ -1,5 +1,6 @@ obj-$(CONFIG_NET_9P) := 9pnet.o obj-$(CONFIG_NET_9P_VIRTIO) += 9pnet_virtio.o +obj-$(CONFIG_NET_9P_RDMA) += 9pnet_rdma.o 9pnet-objs := \ mod.o \ @@ -12,3 +13,6 @@ obj-$(CONFIG_NET_9P_VIRTIO) += 9pnet_virtio.o 9pnet_virtio-objs := \ trans_virtio.o \ + +9pnet_rdma-objs := \ + trans_rdma.o \ ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 03/03] 9prdma: Kconfig changes for the RDMA transport
This patch adds a config option for the 9P RDMA transport. Signed-off-by: Tom Tucker [EMAIL PROTECTED] Signed-off-by: Latchesar Ionkov [EMAIL PROTECTED] --- net/9p/Kconfig |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/net/9p/Kconfig b/net/9p/Kconfig index ff34c5a..c42c0c4 100644 --- a/net/9p/Kconfig +++ b/net/9p/Kconfig @@ -20,6 +20,12 @@ config NET_9P_VIRTIO This builds support for a transports between guest partitions and a host partition. +config NET_9P_RDMA + depends on NET_9P INFINIBAND EXPERIMENTAL + tristate 9P RDMA Transport (Experimental) + help + This builds support for a RDMA transport. + config NET_9P_DEBUG bool Debug information depends on NET_9P ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH 01/03] 9prdma: RDMA Transport Support for 9P
This file implements the RDMA transport provider for 9P. It allows mounts to be performed over iWARP and IB capable network interfaces and uses the OpenFabrics API to perform I/O. Signed-off-by: Tom Tucker [EMAIL PROTECTED] Signed-off-by: Latchesar Ionkov [EMAIL PROTECTED] --- net/9p/trans_rdma.c | 1025 +++ 1 files changed, 1025 insertions(+), 0 deletions(-) diff --git a/net/9p/trans_rdma.c b/net/9p/trans_rdma.c new file mode 100644 index 000..f919768 --- /dev/null +++ b/net/9p/trans_rdma.c @@ -0,0 +1,1025 @@ +/* + * linux/fs/9p/trans_rdma.c + * + * RDMA transport layer based on the trans_fd.c implementation. + * + * Copyright (C) 2008 by Tom Tucker [EMAIL PROTECTED] + * Copyright (C) 2006 by Russ Cox [EMAIL PROTECTED] + * Copyright (C) 2004-2005 by Latchesar Ionkov [EMAIL PROTECTED] + * Copyright (C) 2004-2008 by Eric Van Hensbergen [EMAIL PROTECTED] + * Copyright (C) 1997-2002 by Ron Minnich [EMAIL PROTECTED] + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to: + * Free Software Foundation + * 51 Franklin Street, Fifth Floor + * Boston, MA 02111-1301 USA + * + */ + +#include linux/in.h +#include linux/module.h +#include linux/net.h +#include linux/ipv6.h +#include linux/kthread.h +#include linux/errno.h +#include linux/kernel.h +#include linux/un.h +#include linux/uaccess.h +#include linux/inet.h +#include linux/idr.h +#include linux/file.h +#include linux/parser.h +#include net/9p/9p.h +#include net/9p/transport.h +#include rdma/ib_verbs.h +#include rdma/rdma_cm.h +#include rdma/ib_verbs.h + +#define P9_PORT5640 +#define P9_RDMA_SQ_DEPTH 32 +#define P9_RDMA_RQ_DEPTH 32 +#define P9_RDMA_SEND_SGE 4 +#define P9_RDMA_RECV_SGE 4 +#define P9_RDMA_IRD0 +#define P9_RDMA_ORD0 +#define P9_RDMA_TIMEOUT3 /* 30 seconds */ +#define P9_RDMA_MAXSIZE(4*4096)/* Min SGE is 4, so we can +* safely advertise a maxsize +* of 64k */ + +#define P9_RDMA_MAX_SGE (P9_RDMA_MAXSIZE PAGE_SHIFT) +/** + * struct p9_trans_rdma - RDMA transport instance + * + * @state: tracks the transport state machine for connection setup and tear down + * @cm_id: The RDMA CM ID + * @pd: Protection Domain pointer + * @qp: Queue Pair pointer + * @cq: Completion Queue pointer + * @lkey: The local access only memory region key + * @next_tag: The next tag for tracking rpc + * @timeout: Number of uSecs to wait for connection management events + * @sq_depth: The depth of the Send Queue + * @sq_count: Number of WR on the Send Queue + * @rq_depth: The depth of the Receive Queue. NB: I _think_ that 9P is + * purely req/rpl (i.e. no unaffiliated replies, but I'm not sure, so + * I'm allowing this to be tweaked separately. + * @addr: The remote peer's address + * @req_lock: Protects the active request list + * @req_list: List of sent RPC awaiting replies + * @send_wait: Wait list when the SQ fills up + * @cm_done: Completion event for connection management tracking + */ +struct p9_trans_rdma { + enum { + P9_RDMA_INIT, + P9_RDMA_ADDR_RESOLVED, + P9_RDMA_ROUTE_RESOLVED, + P9_RDMA_CONNECTED, + P9_RDMA_FLUSHING, + P9_RDMA_CLOSING, + P9_RDMA_CLOSED, + } state; + struct rdma_cm_id *cm_id; + struct ib_pd *pd; + struct ib_qp *qp; + struct ib_cq *cq; + struct ib_mr *dma_mr; + u32 lkey; + atomic_t next_tag; + long timeout; + int sq_depth; + atomic_t sq_count; + int rq_depth; + struct sockaddr_in addr; + + spinlock_t req_lock; + struct list_head req_list; + + wait_queue_head_t send_wait; + struct completion cm_done; + struct p9_idpool *tagpool; +}; + +/** + * p9_rdma_context - Keeps track of in-process WR + * + * @wc_op: Mellanox's broken HW doesn't provide the original WR op + * when the CQE completes in error. This forces apps to keep track of + * the op themselves. Yes, it's a Pet Peeve of mine ;-) + * @busa: Bus address to unmap when the WR completes + * @req: Keeps track of requests (send) + * @rcall: Keepts track of replies (receive) + */ +struct p9_rdma_req; +struct p9_rdma_context { + enum ib_wc_opcode wc_op; + dma_addr_t busa; + union
Re: [ofa-general] [PATCH 00/03] RDMA Transport Support for 9P
Roland Dreier wrote: This patchset implements an RDMA transport provider for the v9fs (Plan 9 filesystem). Could you take a look at it and let us know what you think? I sent comments on the initial posting I saw on lkml ... did they not make it to you? No, I just missed it. Sorry. I just responded to your comments, [PATCH 01/03] 9prdma: RDMA Transport Support for 9P [PATCH 02/03] 9prdma: Makefile change for the RDMA transport [PATCH 03/03] 9prdma: Kconfig changes for the RDMA transport one meta-comment I didn't send last time: the patches are small enough that I would just send it all in one patch, since it makes sense to apply it that way anyway. Ok, makes my life easy. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Fast Reg Question
Roland: I'm a little perplexed by the fast reg WR definition. The context is that I'm using the Fast Reg verb to map the local memory that is the data source for an RDMA_WRITE. The WR format, however, only takes an rkey. How does this all work when you're using fast reg to map local memory? Does the WR really need the mr pointer, or both the lkey and rkey? The IBTA spec seems to indicate that it needs more information about the MR than just the rkey. Tom ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] Fast Reg Question
Roland Dreier wrote: I'm a little perplexed by the fast reg WR definition. The context is that I'm using the Fast Reg verb to map the local memory that is the data source for an RDMA_WRITE. The WR format, however, only takes an rkey. How does this all work when you're using fast reg to map local memory? Does the WR really need the mr pointer, or both the lkey and rkey? The IBTA spec seems to indicate that it needs more information about the MR than just the rkey. On Mellanox, L_Key and R_Key are always the same, Also true for iWARP. so it doesn't really matter. I think in general the idea would be that the L_Key you have gets updatedd with any consumer key changes you make in the WR but otherwise works the same. Fair enough. Use the mr-lkey value in the SGE for subsequent DTO. the WR processing better be able to find the MR by R_Key so I think it's OK. It just seems a little weird to be supplying the R_Key when you're mapping local memory. I'll look at the IB spec though. The spec refers to a bunch of verification on the L_Key. Obviously, if the L_Key and R_Key are the same the distinction is moot. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
On Mon, 2008-05-26 at 08:07 -0500, Steve Wise wrote: Roland Dreier wrote: - device-specific alloc/free of physical buffer lists for use in fast register work requests. This allows devices to allocate this memory as needed (like via dma_alloc_coherent). I'm looking at how one would implement the MM extensions for mlx4, and it turns out that in addition to needing to allocate these fastreg page lists in coherent memory, mlx4 is even going to need to write to the memory (basically set the lsb of each address for internal device reasons). So I think we just need to update the documentation of the interface so that not only does the page list belong to the device driver between posting the fastreg work request and completing the request, but also the device driver is allowed to change the page list as part of the work request processing. I don't see any real reason why this would cause problems for consumers; does this seem OK to other people? Tom, Does this affect how you plan to implement NFSRDMA MEM_MGT_EXTENTIONS support? I think this is ok. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] device attributes
On Wed, 2008-06-04 at 09:28 -0500, Steve Wise wrote: Roland/All, Should the device attributes (for instance max_send_wr) be the max supported by the HW or the max supported by the OS or something else? Something else. For instance: Chelsio's HW can handle very large work queues, but since Linux limits the size of contiguous dma coherent memory allocations, the actual limits are much smaller. Basing the limit on an OS resource seems arbitrary and dangerous. Applications using advertised adapter resource limits will unnecessarily consume the maximum. Which should I be using for the device attributes? Arbitrary knee jerk == 512. However, surveying the current app usage as well as the other manufacturers advertised limits will make it less arbitrary. Also, the chelsio device uses a single work queue to implement the SQ and RQ abstractions. So the max SQ depth depends on the RQ depth and vice versa. This leads to device max attributes that aren't that useful. So the real limit is the HW WQ max and therefore max SQ = HW WQ max - RQ max? Setting RQ and SQ to 512 solves this problem. I'm wondering what application writes should glean from these attributes... I guess you mean application writer? Here's what I suggest: - Set the RQ and SQ max to some reasonable default limit (e.g. 512). - Add an escape hatch by providing module options to override the default max. Tom Steve ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
On Tue, 2008-05-27 at 10:33 -0500, Tom Tucker wrote: On Mon, 2008-05-26 at 16:02 -0700, Roland Dreier wrote: The invalidate local stag part of a read is just a local sink side operation (ie no wire protocol change from a read). It's not like processing an ingress send-with-inv. It is really functionally like a read followed immediately by a fenced invalidate-local, but it doesn't stall the pipe. So the device has to remember the read is a with inv local stag and invalidate the stag after the read response is placed and before the WCE is reaped by the application. Yes, understood. My point was just that in IB, at least in theory, one could just use an L_Key that doesn't have any remote permissions in the scatter list of an RDMA read, while in iWARP, the STag used to place an RDMA read response has to have remote write permission. So RDMA read with invalidate makes sense for iWARP, because it gives a race-free way to allow an STag to be invalidated immediately after an RDMA read response is placed, while in IB it's simpler just to never give remote access at all. So I think from an NFSRDMA coding perspective it's a wash... When creating the local data sink, We need to check the transport type. If it's IB -- only local access, if it's iWARP -- local + remote access. When posting the WR, We check the fastreg capabilities bit + transport type bit: If fastreg is true -- Post FastReg If iWARP (or with a cap bit read-with-inv-flag) post rdma read w/ invalidate else /* IB */ post rdma read Steve pointed out a good optimization here. Instead of fencing the RDMA READ here in advance of the INVALIDATE, we should post the INVALIDATE when the READ WR completes. This will avoid stalling the SQ. Since IB doesn't put the LKEY on the wire, there's no security issue to close. We need to keep a bunch of fastreg MR around anyway for concurrent RPC. Thoughts? Tom post invalidate fi else ... today's logic fi I make the observation, however, that the transport type is now overloaded with a set of required verbs. For iWARP's case, this means rdma-read-w-inv, plus rdma-send-w-inv, etc... This also means that new transport types will inherit one or the other set of verbs (IB or iWARP). Tom - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support
On Tue, 2008-05-27 at 12:39 -0400, Talpey, Thomas wrote: At 11:33 AM 5/27/2008, Tom Tucker wrote: So I think from an NFSRDMA coding perspective it's a wash... Just to be clear, you're talking about the NFS/RDMA server. However, it's pretty much a wash on the client, for different reasons. Tom: What client side memory registration strategy do you recommend if the default on the server side is fastreg? On the performance side we are limited by the min size of the read/write-chunk element. If the client still gives the server a 4k chunk, the performance benefit (fewer PDU on the wire) goes away. Tom When posting the WR, We check the fastreg capabilities bit + transport type bit: If fastreg is true -- Post FastReg If iWARP (or with a cap bit read-with-inv-flag) post rdma read w/ invalidate ... For iWARP's case, this means rdma-read-w-inv, plus rdma-send-w-inv, etc... Maybe I'm confused, but I don't understand this. iWARP RDMA Read requests don't support remote invalidate. At least, the table in RFC5040 (p.22) doesn't: ---+---+---+--+---+---+-- RDMA | Message | Tagged| STag | Queue | Invalidate| Message Message| Type | Flag | and | Number| STag | Length OpCode | | | TO | | | Communicated | | | | | | between DDP | | | | | | and RDMAP ---+---+---+--+---+---+-- b | RDMA Write| 1 | Valid| N/A | N/A | Yes | | | | | | ---+---+---+--+---+---+-- 0001b | RDMA Read | 0 | N/A | 1 | N/A | Yes | Request | | | | | ---+---+---+--+---+---+-- 0010b | RDMA Read | 1 | Valid| N/A | N/A | Yes | Response | | | | | ---+---+---+--+---+---+-- 0011b | Send | 0 | N/A | 0 | N/A | Yes | | | | | | ---+---+---+--+---+---+-- 0100b | Send with | 0 | N/A | 0 | Valid | Yes | Invalidate| | | | | ---+---+---+--+---+---+-- 0101b | Send with | 0 | N/A | 0 | N/A | Yes | SE| | | | | ---+---+---+--+---+---+-- 0110b | Send with | 0 | N/A | 0 | Valid | Yes | SE and| | | | | | Invalidate| | | | | ---+---+---+--+---+---+-- 0111b | Terminate | 0 | N/A | 2 | N/A | Yes | | | | | | ---+---+---+--+---+---+-- 1000b | | to | Reserved | Not Specified b | | ---+---+- I want to take this opportunity to also mention that the RPC/RDMA client-server exchange does not support remote-invalidate currently. Because of the multiple stags supported by the rpcrdma chunking header, and because the client needs to verify that the stags were in fact invalidated, there is significant overhead, and the jury is out on that benefit. In fact, I suspect it's a lose at the client. Tom (Talpey). ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH] AMSO1100: Add check for NULL reply_msg in c2_intr
On Fri, 2008-04-04 at 12:35 -0700, Roland Dreier wrote: I'm up to my eyeballs right now. If it's ok with you I'd say defer the refactoring. No problem, I'll queue this up and if you ever get time to work on amso1100 you can send the refactoring. But are you working on a pmtu fix? Steve and I will noodle on what to do here and post something. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH] AMSO1100: Add check for NULL reply_msg in c2_intr
AMSO1100: Add check for NULL reply_msg in c2_intr This is a checker-found bug posted to bugzilla.kernel.org (7478). Upon inspection I also found a place where we could attempt to kmem_cache_free a null pointer. Signed-off-by: Tom Tucker [EMAIL PROTECTED] --- Roland, I don't think anyone has ever hit this bug, so it is a low priority in my view. I also noticed that if we refactored vq_wait_for_reply that we could combine a common if (!reply) { err = -ENOMEM; goto bail; } construct by guaranteeing that reply is non-null if vq_wait_for_reply returns without an error. This patch, however, is much smaller. What do you think? drivers/infiniband/hw/amso1100/c2_cq.c |4 ++-- drivers/infiniband/hw/amso1100/c2_intr.c |6 +- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c index d2b3366..bb17cce 100644 --- a/drivers/infiniband/hw/amso1100/c2_cq.c +++ b/drivers/infiniband/hw/amso1100/c2_cq.c @@ -422,8 +422,8 @@ void c2_free_cq(struct c2_dev *c2dev, struct c2_cq *cq) goto bail1; reply = (struct c2wr_cq_destroy_rep *) (unsigned long) (vq_req-reply_msg); - - vq_repbuf_free(c2dev, reply); + if (reply) + vq_repbuf_free(c2dev, reply); bail1: vq_req_free(c2dev, vq_req); bail0: diff --git a/drivers/infiniband/hw/amso1100/c2_intr.c b/drivers/infiniband/hw/amso1100/c2_intr.c index 0d0bc33..3b50954 100644 --- a/drivers/infiniband/hw/amso1100/c2_intr.c +++ b/drivers/infiniband/hw/amso1100/c2_intr.c @@ -174,7 +174,11 @@ static void handle_vq(struct c2_dev *c2dev, u32 mq_index) return; } - err = c2_errno(reply_msg); + if (reply_msg) + err = c2_errno(reply_msg); + else + err = -ENOMEM; + if (!err) switch (req-event) { case IW_CM_EVENT_ESTABLISHED: c2_set_qp_state(req-qp, ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: [PATCH] AMSO1100: Add check for NULL reply_msg in c2_intr
On Fri, 2008-04-04 at 12:22 -0700, Roland Dreier wrote: I don't think anyone has ever hit this bug, so it is a low priority in my view. I also noticed that if we refactored vq_wait_for_reply that we could combine a common if (!reply) { err = -ENOMEM; goto bail; } construct by guaranteeing that reply is non-null if vq_wait_for_reply returns without an error. This patch, however, is much smaller. What do you think? Well, now is a good time to merge either version of the fix. Would be nice to kill off one of the Coverity issues so I'm happy to take this. It's up to you how much effort you want to spend on this... the refactoring sounds nice but I think we're OK without it. I'm up to my eyeballs right now. If it's ok with you I'd say defer the refactoring. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [nfs-rdma-devel] [ofa-general] Status of NFS-RDMA ? (fwd)
On Fri, 2008-02-29 at 09:29 +0100, Sebastian Schmitzdorff wrote: hi pawel, I was wondering if you have achieved better nfs rdma benchmark results by now? Pawel: What is your network hardware setup? Thanks, Tom regards Sebastian Pawel Dziekonski schrieb: hi, the saga continues. ;) very basic benchmarks and surprising (at least for me) results - it look's like reading is much slower than writing and NFS/RDMA is twice slower in reading than classic NFS. :o results below - comments appreciated! regards, Pawel both nfs server and client have 8-cores, 16 GB RAM, Mellanox DDR HCAs (MT25204) connected port-port (no switch). local_hdd - 2 sata2 disks in soft-raid0, nfs_ipoeth - classic nfs over ethernet, nfs_ipoib - classic nfs over IPoIB, nfs_rdma - NFS/RDMA. simple write of 36GB file with dd (both machines have 16GB RAM): /usr/bin/time -p dd if=/dev/zero of=/mnt/qqq bs=1M count=36000 local_hddsys 54.52user 0.04real 254.59 nfs_ipoibsys 36.35user 0.00real 266.63 nfs_rdma sys 39.03user 0.02real 323.77 nfs_ipoeth sys 34.21user 0.01real 375.24 remount /mnt to clear cache and read a file from nfs share and write it to /dev/: /usr/bin/time -p dd if=/mnt/qqq of=/scratch/qqq bs=1M nfs_ipoib sys 59.04user 0.02real 571.57 nfs_ipoeth sys 58.92user 0.02real 606.61 nfs_rdmasys 62.57user 0.03real 1296.36 results from bonnie++: Version 1.03c --Sequential Write -- --Sequential Read -- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP local_hdd 35G:128k 93353 12 58329 6 143293 7 243.6 1 local_hdd 35G:256k 92283 11 58189 6 144202 8 172.2 2 local_hdd 35G:512k 93879 12 57715 6 144167 8 128.2 4 local_hdd 35G:1024k 93075 12 58637 6 144172 8 95.3 7 nfs_ipoeth 35G:128k 91325 7 31848 464299 4 170.2 1 nfs_ipoeth 35G:256k 90668 7 32036 564542 4 163.2 2 nfs_ipoeth 35G:512k 93348 7 31757 564454 4 85.7 3 nfs_ipoet 35G:1024k 91283 7 31869 564241 5 51.7 4 nfs_ipoib 35G:128k 91733 7 36641 565839 4 178.4 2 nfs_ipoib 35G:256k 92453 7 36567 666682 4 166.9 3 nfs_ipoib 35G:512k 91157 7 37660 666318 4 86.8 3 nfs_ipoib 35G:1024k 92111 7 35786 666277 5 53.3 4 nfs_rdma 35G:128k 91152 8 29942 532147 2 187.0 1 nfs_rdma 35G:256k 89772 7 30560 534587 2 158.4 3 nfs_rdma 35G:512k 91290 7 29698 534277 2 60.9 2 nfs_rdma 35G:1024k 91336 8 29052 531742 2 41.5 3 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min/sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP local_hdd16 10587 36 + +++ 8674 29 10727 35 + +++ 7015 28 local_hdd16 11372 41 + +++ 8490 29 11192 43 + +++ 6881 27 local_hdd16 10789 35 + +++ 8520 29 11468 46 + +++ 6651 24 local_hdd16 10841 40 + +++ 8443 28 11162 41 + +++ 6441 22 nfs_ipoeth 16 3753 7 13390 12 3795 7 3773 8 22181 16 3635 7 nfs_ipoeth 16 3762 8 12358 7 3713 8 3753 7 20448 13 3632 6 nfs_ipoeth 16 3834 7 12697 6 3729 8 3725 9 22807 11 3673 7 nfs_ipoeth 16 3729 8 14260 10 3774 7 3744 7 25285 14 3688 7 nfs_ipoib16 6803 17 + +++ 6843 15 6820 14 + +++ 5834 11 nfs_ipoib16 6587 16 + +++ 4959 9 6832 14 + +++ 5608 12 nfs_ipoib16 6820 18 + +++ 6636 15 6479 15 + +++ 5679 13 nfs_ipoib16 6475 14 + +++ 6435 14 5543 11 + +++ 5431 11 nfs_rdma 16 7014 15 + +++ 6714 10 7001 14 + +++ 5683 8 nfs_rdma 16 7038 13 + +++ 6713 12 6956 11 + +++ 5488 8 nfs_rdma 16 7058 12 + +++ 6797 11 6989 14 + +++ 5761 9 nfs_rdma 16 7201 13 + +++ 6821 12 7072 15 + +++ 5609 9 ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit
Re: [ofa-general] post_recv question
On 2/22/08 12:09 AM, Roland Dreier [EMAIL PROTECTED] wrote: I think we can assume that the ringing of the doorbell is synchronous, i.e. when the processor completes it's write, the card knows there are RQ WQE available in host memory, It doesn't affect your larger point, but to be pedantically precise, writes across PCI will be posted, so the CPU may fully retire a write to MMIO long before that write completes at its final destination. You're right. In fact, I think up to 4 words for the common implementation. But I think this speaks again to the claim that guarantees between adapters on different busses can't work because posted writes go to different FIFO's. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] post_recv question
On Thu, 2008-02-21 at 12:22 -0800, Roland Dreier wrote: OpenMPI can be configured to send credit updates over different QP. I'll try to stress it next week to see what happens. It seems that it would be pretty hard to hit this race in practice. And I don't think mem-free Mellanox hardware has any race -- not positive about Tavor/non-mem-free Arbel. (On IB you need to set RNR retries to 0 also for the missing receive to be detectable even if the race exists) Wellconsider the case of two adapters on two different pci busses. One is busy one is not. Specifically, the post_recv QP is on an HCA on a busy bus, the post_send (of the credit) is on a QP on an HCA on a dedicated bus. I think we can assume that the ringing of the doorbell is synchronous, i.e. when the processor completes it's write, the card knows there are RQ WQE available in host memory, but whether or not and when the WQE is fetched relative to the processor is asynchronous. The card will have to get on the bus again and read host memory. Meanwhile the processor runs off and posts a send on the other QP on a different HCA of the credit. The peer responds, with a send to the data qp. The receiving adapter knows the WQE is there, but it may not have fetched it yet. The crux of the question is whether or not the adapter MUST fetch the WQE and place the packet, or can it simply drop it. If you say it MUST, then you must have enough buffer to handle worst case delayed placement. If the post guarantee is only within the same QP or affiliated QP (SRQ), then all it must do is ensure that when processing a SQ request AND the associated RQ (SRQ) is empty, that it must fetch outstanding, unread RQ WQE prior to processing the SQ WQE. This allows for the post_recv guarantees without the HCA buffering requirements. I seem to recall that the specs say something about ordering and synchronization between unaffiliated QP and/or between adapters, but the specific reference long ago fell off my LRU list. Tom ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] post_recv question
On Thu, 2008-02-21 at 15:48 -0800, Caitlin Bestler wrote: Good example, more detailed comments in-line. On Thu, Feb 21, 2008 at 2:47 PM, Tom Tucker [EMAIL PROTECTED] wrote: On Thu, 2008-02-21 at 12:22 -0800, Roland Dreier wrote: OpenMPI can be configured to send credit updates over different QP. I'll try to stress it next week to see what happens. It seems that it would be pretty hard to hit this race in practice. And I don't think mem-free Mellanox hardware has any race -- not positive about Tavor/non-mem-free Arbel. (On IB you need to set RNR retries to 0 also for the missing receive to be detectable even if the race exists) Wellconsider the case of two adapters on two different pci busses. One is busy one is not. Specifically, the post_recv QP is on an HCA on a busy bus, the post_send (of the credit) is on a QP on an HCA on a dedicated bus. I think we can assume that the ringing of the doorbell is synchronous, i.e. when the processor completes it's write, the card knows there are RQ WQE available in host memory, but whether or not and when the WQE is fetched relative to the processor is asynchronous. The card will have to get on the bus again and read host memory. Meanwhile the processor runs off and posts a send on the other QP on a different HCA of the credit. The peer responds, with a send to the data qp. The receiving adapter knows the WQE is there, but it may not have fetched it yet. The crux of the question is whether or not the adapter MUST fetch the WQE and place the packet, or can it simply drop it. If you say it MUST, then you must have enough buffer to handle worst case delayed placement. If the post guarantee is only within the same QP or affiliated QP (SRQ), then all it must do is ensure that when processing a SQ request AND the associated RQ (SRQ) is empty, that it must fetch outstanding, unread RQ WQE prior to processing the SQ WQE. This allows for the post_recv guarantees without the HCA buffering requirements. I disagree. What is required is the adapter MUST NOT take an action based on a buffer not available diagnosis until it is certain that it has considered all WQEs that have been successfully posted by the consumer. Ok. So what does the HW do with the packet while it's pondering it's options? It has to put it somewhere. That's my point. You either guarantee that any advertisement of availability can't be issued prior to the buffer being available, or the buffer is synchronously available prior to the advertisement of the credit. Snooping the [s]RQ while processing SQ is a way of delaying the issuance of a credit before the buffer (spec'd in the WQE) is actually known to the adapter. But this only works in the context of a single HBA. Further, it MUST NOT require a further action by the consumer to guarantee that it notices a posted WQE. Agreed. Particularly in iWARP the application layer is free to implement Send/Recv credits by *any* mechanism desired (the only requirement is that there is one, you might recall that there were extensive discussions on this point regarding unsolicited messages for iSER). The concept that the application MUST provide SOME form of flow control was accepted only grudgingly. So clearly any more specific mechanisms were not the intent of the drafters. Yes, but I'm not sure there's any confusion there -- I think this discussion is about how credits can be issued. In particular what does it mean to issue a credit for: - this QP, - another QP on the same HCA - another QP on a different HCA So far, it seems the consensus is that all of the above should work. I'm just not convinced the current implementations guarantee this. So if there are still 1000 Recv WQEs in the SRQ we can allow the adapter a great deal of flexibility in when the 1001st is linked into the data structures. The only real constraint is that it MUST do 1001 successful allocations *before* it triggers any sort of buffer not available error. agreed. I'm not recalling the specific language immediately, but I do recall concluding that sub-dividing the SRQ on an RSS-like basis was *not* compliant with the RDMAC specs and that the left-half of the adpater could not declare buffer not found while the right-half of the adapter still had a free buffer. agreed. This is of course a major pain if you are trying to team two RDMA adapters to form a single virtual adapter, or even two largely independent ports on the same physical adapter. But the intent of the specifications are very clear: if the consumer has posted 1000 recv WQEs and gotten SUCCESS to each of them, then the adapter MUST allocate all 1000 recv WQEs *before* it can fail an operation because no buffer was available. agreed. So there is a difference between must be pushed to the adapter now and must be pushed to the adapter before it is too late. yes. Tom
Re: [ofa-general] iommu dma mapping alignment requirements
On Thu, 2007-12-20 at 11:14 -0600, Steve Wise wrote: Hey Roland (and any iommu/ppc/dma experts out there): I'm debugging a data corruption issue that happens on PPC64 systems running rdma on kernels where the iommu page size is 4KB yet the host page size is 64KB. This feature was added to the PPC64 code recently, and is in kernel.org from 2.6.23. So if the kernel is built with a 4KB page size, no problems. If the kernel is prior to 2.6.23 then 64KB page configs work too. Its just a problem when the iommu page size != host page size. It appears that my problem boils down to a single host page of memory that is mapped for dma, and the dma address returned by dma_map_sg() is _not_ 64KB aligned. Here is an example: app registers va 0x2d9a3000 len 12288 ib_umem_get() creates and maps a umem and chunk that looks like (dumping state from a registered user memory region): umem len 12288 off 12288 pgsz 65536 shift 16 chunk 0: nmap 1 nents 1 sglist[0] page 0xc0930b08 off 0 len 65536 dma_addr 5bff4000 dma_len 65536 So the kernel maps 1 full page for this MR. But note that the dma address is 5bff4000 which is 4KB aligned, not 64KB aligned. I think this is causing grief to the RDMA HW. My first question is: Is there an assumption or requirement in linux that dma_addressess should have the same alignment as the host address they are mapped to? IE the rdma core is mapping the entire 64KB page, but the mapping doesn't begin on a 64KB page boundary. If this mapping is considered valid, then perhaps the rdma hw is at fault here. But I'm wondering if this is an PPC/iommu bug. BTW: Here is what the Memory Region looks like to the HW: TPT entry: stag idx 0x2e800 key 0xff state VAL type NSMR pdid 0x2 perms RW rem_inv_dis 0 addr_type VATO bind_enable 1 pg_size 65536 qpid 0x0 pbl_addr 0x003c67c0 len 12288 va 2d9a3000 bind_cnt 0 PBL: 5bff4000 Any thoughts? The Ammasso certainly works this way. If you tell it the page size is 64KB, it will ignore bits in the page address that encode 0-65535. Steve. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] OFA server patching
On Thu, 2007-11-29 at 09:07 -0600, Steve Wise wrote: Jeff Becker wrote: Hi all. In the interest of keeping our server up to date, I applied the latest Ubuntu patches. Several upgrades were made, including git. If you have any problems, let me know. Thanks. -jeff ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general Git seems broken for me. I can no longer use the build_ofa_kernel.sh script. I get this sort of error: fatal: corrupted pack file /home/vlad/scm/ofed_1_2/.git/objects/pack/pack-914d44 0d906ffa47a30611df81c0597e896040fa.pack I think the version of git you're using is old and doesn't recognize some of the object types in the repository. I saw this same thing when I tried to use a git tree that had remotes created with a newer version of git. Failed executing git ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] NFS-RDMA for OFED 1.3
Jeff: There's an updated version of the server transport switch and rdma transport provider available here: git://linux-nfs.org/~tomtucker/nfs-rdma-dev-2.6.git Tom ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: iWARP peer-to-peer CM proposal
On Wed, 2007-11-28 at 11:43 -0500, Caitlin Bestler wrote: -Original Message- From: Steve Wise [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 27, 2007 4:48 PM To: Caitlin Bestler Cc: Kanevsky, Arkady; Glenn Grundstrom; Leonid Grossman; openib- [EMAIL PROTECTED] Subject: Re: [ofa-general] Re: iWARP peer-to-peer CM proposal Caitlin Bestler wrote: On Nov 27, 2007 3:58 PM, Steve Wise [EMAIL PROTECTED] wrote: For the short term, I claim we just implement this as part of linux iwarp connection setup (mandating a 0B read be sent from the active side). Your proposal to add meta-data to the private data requires a standards change anyway and is, IMO, the 2nd phase of this whole enchilada... Steve. I don't see how you can have any solution here that does not require meta-data. For non-peer-to-peer connections neither a zero length RDMA Read or Write should be sent. An extraneous RDMA Read is particularly onerous for a short lived connection that fits the classic active/passive model. So *something* is telling the CMA layer that this connection may need an MPA unjam action. If that isn't meta-data, what is it? I assumed the 0B read would _always_ be sent as part of establishing an iWARP connection using linux and the rdma-cm. That is an extra round-trip per connection setup, which is a significant penalty for a short lived connection. It is trivial for HPC/peer-to-peer applications, but would be a killer for something like HTTP over RDMA. I find it hard to get excited about optimizing short lived connections for RDMA. I simply don't think it's an interesting use case. And btw, HTTP long ago got rid of short lived connections because it's painful even on TCP. Doing something like this for *every* connection makes it effectively a change to the MPA protocol. Uh. No, it doesn't. Normalizing the behavior of applications during connection setup doesn't change the underlying protocol. It adds another one on top. OFA is not the forum for such discussions, the IETF is. My living room, the dinner table, the local bar and this mailing list are perfectly acceptable forums for discussing a protocol. The IETF is the forum for standardizing one. Right now, I don't think we're ready to standardize, because we're still exploring the options; the first of which is NOT changing MPA. This group has the unique benefit of actually USING and IMPLEMENTING the protocol and therefore has some beneficial insights that may and should be shared. All that said revving the MPA protocol is way down the road. OFA drafting an understanding of how peer-to-peer applications use the existing protocol, on the other hand, is quite reasonable. That's step 1 and the 0B READ is one way to do it. But it has to be something done by peer-to-peer middleware or by the verbs layer in response to a flag from the peer-to-peer middleware. Otherwise it is not augmenting a protocol, it is changing it. The flag may be useful, however, I don't see the connection between the flag and complying with the MPA protocol. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [RFP] support for iWARP requirement - activeconnectside MUST send first FPDU
Felix Marti wrote: -Original Message- From: Tom Tucker [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 23, 2007 9:32 PM To: Felix Marti Cc: Kanevsky, Arkady; Glenn Grundstrom; Sean Hefty; Steve Wise; Roland Dreier; [EMAIL PROTECTED]; OpenFabrics General Subject: Re: [ofa-general] [RFP] support for iWARP requirement - activeconnectside MUST send first FPDU Felix Marti wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:general- [EMAIL PROTECTED] On Behalf Of Kanevsky, Arkady Sent: Tuesday, October 23, 2007 6:26 PM To: Glenn Grundstrom; Sean Hefty; Steve Wise Cc: Roland Dreier; [EMAIL PROTECTED]; OpenFabrics General Subject: RE: [ofa-general] [RFP] support for iWARP requirement - activeconnectside MUST send first FPDU This is still a protocol and should be defined by IETF not OFA. But if we get agreement from all iWARP vendors this will be a good step. [felix] This will not work with a Chelsio RNIC which follows the IETF specification by a) not issuing a 0B RDMA Write to the wire and b) silently consuming an incoming 0B write. Therefore 0B RDMA Writes cannot be 'abused' for such a synchronization mechanism. I believe that the mentioned apps adhering to the iWarp requirement do a 'send' from the active side and only have the passive side issue RDMA ops once the incoming send has been received. I would guess that following a similar model is the best way to go and supported by all iWarp vendors implementing the IETF spec. IMO, the iWARP vendors _must_ get together and work on MPA '2'. Standardizing FPDU 'abuse' might be a good place to start, but it needs to be fixed to support peer-to-peer going forward. In the mean-time, imperfectly hiding the issue in the Firmware, uDAPL, the iWARP CM or anywhere else except the application seems to me to be the only customer friendly solution. [felix] While I'm not against trying to hide the connection migration details somewhere below the ULP, I'm not convinced that the issue is as severe as you make it to be and I would not press to have the issue resolved in a matter that requires a new MPA version. In fact, the different rdma transports (and maybe even different versions of the same transport (in the case of IB)) provide different features and I would assume that ULPs will eventually code to these features and must thus be aware of the underlying transport protocol. In that bigger picture, the connection migration issue at hand seems fairly trivial to solve even if it requires an ULP change... I didn't make an argument about severity. Qualifying the severity is in the customer's purview. I'm simply pointing out the following: a) the perspective that the restriction is trivial is how we got here, b) making the app change is putting a decision in the customer's hands that IMO an iWARP vendor would rather they didn't have to make Do I or don't I support iWARP?, and c) you have the power to hide this behavior for most cases. Finally, I believe RFC means Request for Comment. Well here's one last comment -- Add an FPDU message at the end of MPA exchange and fix the problem in the protocol. If we can not get agreement on it on reflector lets do it at SC'07 OFA dev. conference. Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Glenn Grundstrom [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 23, 2007 9:02 PM To: Sean Hefty; Steve Wise Cc: Roland Dreier; [EMAIL PROTECTED]; OpenFabrics General Subject: RE: [ofa-general] [RFP] support for iWARP requirement - activeconnect side MUST send first FPDU That is what I've been trying to push. Both MVAPICH2 and OMPI have been open to adjusting their transports to adhere to this requirement. I wouldn't mind implementing something to enforce this in the IWCM or the iWARP drivers IF there was a clean way to do it. So far there hasn't been a clean way proposed. Why can't either uDAPL or iW CM always do a send from the active to passive side that gets stripped off? From the active side, the first send is always posted before any user sends, and if necessary, a user send can be queued by software to avoid a QP/CQ overrun. The completion can simply be eaten by software. On the passive side, you have a similar process for receiving the data. This is similar to an option in the NetEffect driver. A zero byte RDMA write is sent from the active side and accounted for on the passive side. This can
Re: [ofa-general] [RFP] support for iWARP requirement - active connect side MUST send first FPDU
Michael Krause wrote: At 01:17 PM 10/23/2007, Steve Wise wrote: Sean Hefty wrote: There has been much discussion on a private thread regarding bug #735 - dapltest performance tests don't adhere to iWARP standard that needs to move to the general list. This bug would be better titled iWarp cannot support uDAPL API. :) Seriously, the iWarp and uDAPL specs conflict. One needs to change. Can someone come up with a solution, possibly in iWARP CM, that will work and insure interoperability between iWARP devices? I thought the restriction was there to support switching between streaming and rdma mode. If a connection only uses rdma mode, is the restriction really needed at all? Yes because all iWARP connections start out as TCP streaming mode connections, and the MPA startup messages are sent in streaming mode. Then the connection is transitioned into FPDU (Framed PDU) mode using the MPA protocol. Correct. The IETF was very clear on these requirements (significant debate occurred over at least 12-18 months) and there is unlikely to be any traction in changing the iWARP specifications to provide another mechanism. Best to provide API that detect which semantics are required and then if the application cannot adjust, then it cannot use the iWARP semantics. First let me apologize in advance, but that is simply not a workable solution for the customer. I'm not taking anything away from the efforts of those involved with the definition of the MPA protocol, however, unfortunately that protracted debate occurred 2-3 years in advance of a deployed solution. The duration of the debate doesn't overcome the absence of practical perspective. There are now multiple implementations, the customers of which are complaining about the cost of the compromises made. We now have the benefit of hindsight and in my option should rev the MPA protocol. After all, that's why the number is there in the header -- right? It may be that those involved with the original debate have no interest in revisiting it, but IMO that is irrelevant. There are now new companies involved that implemented RDDP, have customers using it, and have a sustaining (both interpretations intended) interest in making RDDP better. I, for one, would encourage them to do so. Protocols are not immutable, unless they're dead. BTW, if one uses the SDP port mapper protocol (see the IETF SDP specification), one can detect from the start that RDMA is being used and one could start in RDMA mode sans the MPA requirement. The SDP port mapper protocol also enables one to apply various other policies such as determining whether the application / remote node session should be allowed to run over RDMA or not - simple point of control for management. Really? What about CRC, Markers and Private Data? Mike ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [RFP] support for iWARP requirement - activeconnectside MUST send first FPDU
Felix Marti wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:general- [EMAIL PROTECTED] On Behalf Of Kanevsky, Arkady Sent: Tuesday, October 23, 2007 6:26 PM To: Glenn Grundstrom; Sean Hefty; Steve Wise Cc: Roland Dreier; [EMAIL PROTECTED]; OpenFabrics General Subject: RE: [ofa-general] [RFP] support for iWARP requirement - activeconnectside MUST send first FPDU This is still a protocol and should be defined by IETF not OFA. But if we get agreement from all iWARP vendors this will be a good step. [felix] This will not work with a Chelsio RNIC which follows the IETF specification by a) not issuing a 0B RDMA Write to the wire and b) silently consuming an incoming 0B write. Therefore 0B RDMA Writes cannot be 'abused' for such a synchronization mechanism. I believe that the mentioned apps adhering to the iWarp requirement do a 'send' from the active side and only have the passive side issue RDMA ops once the incoming send has been received. I would guess that following a similar model is the best way to go and supported by all iWarp vendors implementing the IETF spec. IMO, the iWARP vendors _must_ get together and work on MPA '2'. Standardizing FPDU 'abuse' might be a good place to start, but it needs to be fixed to support peer-to-peer going forward. In the mean-time, imperfectly hiding the issue in the Firmware, uDAPL, the iWARP CM or anywhere else except the application seems to me to be the only customer friendly solution. If we can not get agreement on it on reflector lets do it at SC'07 OFA dev. conference. Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Glenn Grundstrom [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 23, 2007 9:02 PM To: Sean Hefty; Steve Wise Cc: Roland Dreier; [EMAIL PROTECTED]; OpenFabrics General Subject: RE: [ofa-general] [RFP] support for iWARP requirement - activeconnect side MUST send first FPDU That is what I've been trying to push. Both MVAPICH2 and OMPI have been open to adjusting their transports to adhere to this requirement. I wouldn't mind implementing something to enforce this in the IWCM or the iWARP drivers IF there was a clean way to do it. So far there hasn't been a clean way proposed. Why can't either uDAPL or iW CM always do a send from the active to passive side that gets stripped off? From the active side, the first send is always posted before any user sends, and if necessary, a user send can be queued by software to avoid a QP/CQ overrun. The completion can simply be eaten by software. On the passive side, you have a similar process for receiving the data. This is similar to an option in the NetEffect driver. A zero byte RDMA write is sent from the active side and accounted for on the passive side. This can be turned on and off by compile and module options for compatibility. I second Sean's question - why can't uDAPL or the iw_cm do this? (Yes this adds wire protocol, which requires both sides to support it.) - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] iSER data corruption issues
On Thu, 2007-10-04 at 12:14 -0400, Pete Wyckoff wrote: [EMAIL PROTECTED] wrote on Wed, 03 Oct 2007 15:01 -0700: Machines are opteron, fedora 7 up-to-date with its openfab libs, kernel 2.6.23-rc6 on target. Either 2.6.23-rc6 or 2.6.22 or 2.6.18-rhel5 on initiator. For some reason, it is much easier to produce with the rhel5 kernel. There was a bug in mthca that caused data corruption with FMRs on Sinai (1-port PCIe) HCAs. It was fixed in commit 608d8268 (IB/mthca: Fix data corruption after FMR unmap on Sinai) which went in shortly before 2.6.21 was released. I don't know if the RHEL5 2.6.18 kernel has this fix or not -- but if you still see the problem on 2.6.22 and later kernels then this isn't the fix anyway. This is definitely it. Same test setup runs for an hour with this patch, but fails in tens of seconds without it. Thanks for pointing it out. This rhel5 kernel is 2.6.18-8.1.6. Perhaps there are newer ones about that have this critical patch included. I'm going to add a Big Fat Warning on the iser distribution about pre-2.6.21 kernels. It also crashes if the iSER connection drops in a certain easy-to-reproduce way, another reason to avoid it. Regarding the larger test I talked about that fails even on modern kernels, I'm still not able to reproduce that on my setup. I ran it literally all night with a hacked target that calculated the return buffer rather than accessing the disk. For now I'm calling that a separate bug and will investigate it further. Thanks to Tom and Tom for helping debug this. Thanks to Roland who actually knew what it was ... ;-) -- Pete ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] iSER data corruption issues
On Wed, 2007-10-03 at 13:42 -0400, Pete Wyckoff wrote: How does the requester (in IB speak) know that an RDMA Write operation has completed on the responder? We have a software iSER target, available at git.osc.edu/tgt or browse at http://git.osc.edu/?p=tgt.git . Using the existing in-kernel iSER initiator code, very rarely data corruption occurs, in that the received data from SCSI read operations does not match what was expected. Sometimes it appears as if random kernel memory has been scribbled on by an errant RDMA write from the target. My current working theory that the RDMA write has not completed by the time the initiator looks at its incoming data buffer. Single RC QP, single CQ, no SRQ. Only Send, Receive, and RDMA Write work requests are used. After everything is connected up, a SCSI read sequence looks like: initiator: register pages with FMR, write test pattern initiator: Send request to target target:Recv request target:RDMA Write response to initiator target:Wait for CQ entry for local RDMA Write completion Pete: I don't think this should be necessary... target:Send response to initiator ...as long as the send is posted on the same SQ as the write. initiator: Recv response, access buffer On very rare occasions, this buffer will have the test pattern, not the data that the target just sent. Machines are opteron, fedora 7 up-to-date with its openfab libs, kernel 2.6.23-rc6 on target. Either 2.6.23-rc6 or 2.6.22 or 2.6.18-rhel5 on initiator. For some reason, it is much easier to produce with the rhel5 kernel. One site with fast disks can see similar corruption with 2.6.23-rc6, however. Target is pure userspace. Initiator is in kernel and is poked by lmdd (like normal dd) through an iSCSI block device (/dev/sdb). The IB spec seems to indicate that the contents of the RDMA Write buffer should be stable after completion of a subsequent send message (o9-20). In fact, the Wait for CQ entry step on the target should be unnecessary, no? I think so too. Could there be some caching issues that the initiator is missing? I've added print[fk]s to the initiator and target to verify that the sequence of events is truly as above, and that the virtual addresses are as expected on both sides. Any suggestions or advice would help. Thanks, If your theory is correct, the data should eventually show up. Does it? Does your code check for errors on dma_map_single/page? -- Pete P.S. Here are some debugging printfs not in the git. Userspace code does 200 read()s of length 8000, but complains about the result somewhere in the 14th read, from 112000 to 12, and exits early. Expected pattern is a series of 40 4-byte words, incrementing by 4, starting from 0. So 0x, 0x0004, ..., 0x001869fc: % lmdd of=internal ipat=1 if=/dev/sdb bs=8000 count=200 mismatch=10 off=112000 want=1c000 got=3b3b3b3b Initiator generates a series of SCSI operations, as driven by readahead and the block queue scheduler. You can see that it starts reading 4 pages, then 1 page, then 23 pages, then 1 page and so on, in order. These sizes and offsets vary from run to run. Each line here is printed after the SCSI read response has been received. It prints the first word in the buffer, and you can see the test pattern where data should be: tag 02 va 36061000 len 4000 word0 ref 1 tag 03 va 36065000 len 1000 word0 4000 ref 1 tag 04 va 36066000 len 17000 word0 5000 ref 1 tag 05 va 7b6bc000 len 1000 word0 3b3b3b3b ref 1 Is it interesting that the bad word occurs on the first page of the new map? tag 06 va 7b6bd000 len 1f000 word0 0001d000 ref 1 tag 07 va 7bdc2000 len 2 word0 0003c000 ref 1 The userspace target code prints a line when it starts the RDMA write, then a line when the RDMA write completes locally, then a line when it sends the repsponse. The tags are what the initiator assigned to each request. The target thinks it is sending a 4096-byte buffer that has 0x1c000 as its first word, but the initiator did not see it: tag 02 va 36061000 len 4000 word0 rdmaw tag 02 rdmaw completion tag 02 resp tag 03 va 36065000 len 1000 word0 4000 rdmaw tag 03 rdmaw completion tag 03 resp tag 04 va 36066000 len 17000 word0 5000 rdmaw tag 04 rdmaw completion tag 04 resp tag 05 va 7b6bc000 len 1000 word0 0001c000 rdmaw tag 05 rdmaw completion tag 05 resp tag 06 va 7b6bd000 len 1f000 word0 0001d000 rdmaw tag 06 rdmaw completion tag 07 va 7bdc2000 len 2 word0 0003c000 rdmaw tag 07 rdmaw completion tag 06 resp tag 07 resp ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device
On Wed, 2007-09-26 at 14:06 -0500, Jim Mott wrote: This is a two part bug report. One is a conceptual problem that may just be a problem of understanding on my part. The other is what I believe to be a bug in the mlx4 driver. mthca has the same issue. 1) ib_create_qp() fails with max_sge If you use ib_query_device() to return the device specific attribute max_sge, it seems reasonable to expect you can create a QP with max_send_sge=max_sge. The problem is that this often fails. The reason is that depending on the QP type (RC, UD, etc.) and how the QP will be used (send, RDMA, atomic, etc.), there can be extra segments required in the WQE that eat up SGE entries. So while some send WQE might have max_sge available SGEs, many will not. Normally the difference between max_sge and the actual maximum value allowed (and checked) for max_send_sge is 1 or 2. This issue may need API extensions to definitively resolve. In the short term, it would be very nice if max_sge reported by ib_query_device() could always return a value that ib_create_qp() could use. Think of it as the minimum max_send_sge value that will work for all QP types. 2) mlx4 setting of max send SQEs The recent patch to support shrinking WQEs introduces a behavior that creates a big difference between the mlx4 supported send SGEs (checked against 61, should be 59 or 60, and reported in ib_query_device as 32 to equal receive side max_rq_sg value). The patch that follows will allow an MLX4 to support the number of send SGEs returned by ib_query_devce, and in fact quite a few more. It probably breaks shrinking WQEs and thus should not be applied directly. Note that if ib_query_device() returned max_sge adjusted for the raddr and atomic segments, this fix would not be needed. MLX4 would still support more SGEs in hardware than can be used through the API, but that is a different problem. --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-09-26 13:27:47.0 -0500 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/qp.c 2007-09-26 13:36:40.0 -0500 @@ -370,7 +370,7 @@ static int set_kernel_sq_size(struct mlx qp-sq.wqe_shift = ilog2(roundup_pow_of_two(s)); for (;;) { - if (1 qp-sq.wqe_shift dev-dev-caps.max_sq_desc_sz) + if (s dev-dev-caps.max_sq_desc_sz) return -EINVAL; qp-sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1 qp-sq.wqe_shift); ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device
FWIW, I have code in my apps that retries QP creation with reduced values when the allocation with max fails. There was also an earlier e-mail thread on this exact same issue, but the solution bantered about was to use special values in the qp_attr structure ala QP_MAX_SEND_SGE (-1?). The provider would recognize this value and allocate the max for that attribute that would succeed given the current resource situation. The qp_attr structure would then be updated by the provider with the values given. This approach extends, but doesn't break the API, allows existing apps to work as usual, and avoids the retry logic that I've added to my apps. Just a thought, Tom On Wed, 2007-09-26 at 20:41 -0500, Jim Mott wrote: The same bug exists with mthca. I saw it originally in the kernel doing RDS work, but I just put together a short user space test. ibv_query_device(MT25204) returns max_sge=30 - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge fails - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge-1 works I only have the two types of adapters to test with. -Original Message- From: Roland Dreier [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 26, 2007 5:32 PM To: Jim Mott Cc: general@lists.openfabrics.org Subject: Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device A minimal API change that could help would be to add two new fields to ib_device_attr structure returned by ib_query_device: - delta_sge_sg - delta_sge_rd Hmm, a cute idea but I'm still left wondering if it's worth the ABI breakage etc just to give a few more S/G entries in some situations. The behavior would be that in all cases using max_sge for send or receive SGE count in create_qp would always succeed. That means the current value the drivers return there would have to be reduced to fix this bug. All existing codes would continue to run. Actually are there any drivers other than patched mlx4 where max_sge doesn't always work? I agree we do want to get this right, but I thought we had fixed all such bugs. (And we should make sure that any shrinking WQE patch for mlx4 doesn't introduce new bugs) (BTW I see a different bug in unpatched mlx4, namely that it might report a too-big number of S/G entries allowed for the SQ) It looks like there is some movement in this direction already with the fields: - max_sge_rd (nes, amso1100, ehca, cxgb3 only) This field is obsolete, since we don't handle RD and almost certainly never will. I'm not sure why anyone is setting a value. - max_srq_sge (amso1100, mthca, mlx4, ehca, ipath only) Any devices that handle SRQ should set this. I think cxgb3 does not support SRQ. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only interfacesto avoid 4-tuple conflicts.
On Mon, 2007-09-24 at 16:30 -0500, Glenn Grundstrom wrote: -Original Message- From: Roland Dreier [mailto:[EMAIL PROTECTED] Sent: Monday, September 24, 2007 2:33 PM To: Glenn Grundstrom Cc: Steve Wise; [EMAIL PROTECTED]; general@lists.openfabrics.org Subject: Re: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only interfacesto avoid 4-tuple conflicts. I'm sure I had seen a previous email in this thread that suggested using a userspace library to open a socket in the shared port space. It seems that suggestion was dropped without reason. Does anyone know why? Yes, because it doesn't handle in-kernel uses (eg NFS/RDMA, iSER, etc). The kernel apps could open a Linux tcp socket and create an RDMA socket connection. Both calls are standard Linux kernel architected routines. This approach was NAK'd by David Miller and others... Doesn't NFSoRDMA already open a TCP socket and another for RDMA traffic (ports 2049 2050 if I remember correctly)? The NFS RDMA transport driver does not open a socket for the RDMA connection. It uses a different port in order to allow both TCP and RDMA mounts to the same filer. I currently don't know if iSER, RDS, etc. already do the same thing, but if they don't, they probably could very easily. Woe be to those who do so... Does the neteffect NIC have the same issue as cxgb3 here? What are your thoughts on how to handle this? Yes, the NetEffect RNIC will have the same issue as Chelsio. And all Future RNIC's which support a unified tcp address with Linux will as well. Steve has put a lot of thought and energy into the problem, but I don't think users admins will be very happy with us in the long run. Agreed. In summary, short of having the rdma_cm share kernel port space, I'd like to see the equivalent in userspace and have the kernel apps handle the issue in a similar way as described above. There are a few technical issues to work through (like passing the userspace IP address to the kernel), This just moves the socket creation to code that is outside the purview the kernel maintainers. The exchanging of the 4-tuple created with the kernel module, however, is back in the kernel and in the maintainer's control and responsibility. In my view anything like this will be viewed as an attempt to sneak code into the kernel that the maintainer has already vehemently rejected. This will make people angry and damage the cooperative working relationship that we are trying to build. but I think we can solve that just like other information that gets passed from user into the IB/RDMA kernel modules. Sharing the IP 4-tuple space cooperatively with the core in any fashion has been nak'd. Without this cooperation, the options we've been able to come up with are administrative/policy based approaches. Any ideas you have along these lines are welcome. Tom Glenn. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [RFC,PATCH 00/20] svc: Server Side Transport Switch
This patchest modifies the RPC server side implementation to support pluggable transports. This was done in order to allow RPC applications (NFS) to run over RDMA transports like IB and iWARP. This patchset was also published to [EMAIL PROTECTED] This patchset represents an update to the previously published version. The most significant changes are a renaming of the transport switch data structures and functions based on a recommendation from Chuck Lever. Code cleanup was also done in the portlist implementation based on feedback from Trond. I've included the original description below for new reviewers. This patchset implements a sunrpc server side pluggable transport switch that supports dynamically registered transports. The knfsd daemon has been modified to allow user-mode programs to add a new listening endpoint by writing a string to the portlist file. The format of the string is as follows: transport-name port For example, # echo rdma 2050 /proc/fs/nfsd/portlist Will cause the knfsd daemon to attempt to add a listening endpoint on port 2050 using the 'rdma' transport. Transports register themselves with the transport switch using a new API that has the following synopsis: void svc_register_transport(struct svc_sock_ops *xprt) The text transport name is contained in a field in the xprt structure. A new service has been added as well to take a transport name instead of an IP protocol number to specify the transport on which the listening endpoint is to be created. This function is defined as follows: int svc_create_svcsock(struct svc_serv, char *transport_name, unsigned short port, int flags); The existing svc_makesock interface was left to avoid impacts to existing servers. It has been modified to map IP protocol numbers to transport strings. -- Signed-off-by: Tom Tucker [EMAIL PROTECTED] ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [RFC, PATCH 01/20] svc: Add svc_xprt transport switch structure
Start moving to a transport switch for knfsd. Add a svc_xprt switch and move the sk_sendto and sk_recvfrom function pointers into it. Signed-off-by: Greg Banks [EMAIL PROTECTED] Signed-off-by: Peter Leckie [EMAIL PROTECTED] Signed-off-by: Tom Tucker [EMAIL PROTECTED] --- include/linux/sunrpc/svcsock.h |9 +++-- net/sunrpc/svcsock.c | 22 -- 2 files changed, 23 insertions(+), 8 deletions(-) diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h index e21dd93..4792ed6 100644 --- a/include/linux/sunrpc/svcsock.h +++ b/include/linux/sunrpc/svcsock.h @@ -11,6 +11,12 @@ #define SUNRPC_SVCSOCK_H #include linux/sunrpc/svc.h +struct svc_xprt { + const char *xpt_name; + int (*xpt_recvfrom)(struct svc_rqst *rqstp); + int (*xpt_sendto)(struct svc_rqst *rqstp); +}; + /* * RPC server socket. */ @@ -43,8 +49,7 @@ #define SK_DETACHED 10 /* detached fro * be revisted */ struct mutexsk_mutex; /* to serialize sending data */ - int (*sk_recvfrom)(struct svc_rqst *rqstp); - int (*sk_sendto)(struct svc_rqst *rqstp); + const struct svc_xprt *sk_xprt; /* We keep the old state_change and data_ready CB's here */ void(*sk_ostate)(struct sock *); diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index 5baf48d..789d94a 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -885,6 +885,12 @@ svc_udp_sendto(struct svc_rqst *rqstp) return error; } +static const struct svc_xprt svc_udp_xprt = { + .xpt_name = udp, + .xpt_recvfrom = svc_udp_recvfrom, + .xpt_sendto = svc_udp_sendto, +}; + static void svc_udp_init(struct svc_sock *svsk) { @@ -893,8 +899,7 @@ svc_udp_init(struct svc_sock *svsk) svsk-sk_sk-sk_data_ready = svc_udp_data_ready; svsk-sk_sk-sk_write_space = svc_write_space; - svsk-sk_recvfrom = svc_udp_recvfrom; - svsk-sk_sendto = svc_udp_sendto; + svsk-sk_xprt = svc_udp_xprt; /* initialise setting must have enough space to * receive and respond to one request. @@ -1322,14 +1327,19 @@ svc_tcp_sendto(struct svc_rqst *rqstp) return sent; } +static const struct svc_xprt svc_tcp_xprt = { + .xpt_name = tcp, + .xpt_recvfrom = svc_tcp_recvfrom, + .xpt_sendto = svc_tcp_sendto, +}; + static void svc_tcp_init(struct svc_sock *svsk) { struct sock *sk = svsk-sk_sk; struct tcp_sock *tp = tcp_sk(sk); - svsk-sk_recvfrom = svc_tcp_recvfrom; - svsk-sk_sendto = svc_tcp_sendto; + svsk-sk_xprt = svc_tcp_xprt; if (sk-sk_state == TCP_LISTEN) { dprintk(setting up TCP socket for listening\n); @@ -1477,7 +1487,7 @@ svc_recv(struct svc_rqst *rqstp, long ti dprintk(svc: server %p, pool %u, socket %p, inuse=%d\n, rqstp, pool-sp_id, svsk, atomic_read(svsk-sk_inuse)); - len = svsk-sk_recvfrom(rqstp); + len = svsk-sk_xprt-xpt_recvfrom(rqstp); dprintk(svc: got len=%d\n, len); /* No data, incomplete (TCP) read, or accept() */ @@ -1537,7 +1547,7 @@ svc_send(struct svc_rqst *rqstp) if (test_bit(SK_DEAD, svsk-sk_flags)) len = -ENOTCONN; else - len = svsk-sk_sendto(rqstp); + len = svsk-sk_xprt-xpt_sendto(rqstp); mutex_unlock(svsk-sk_mutex); svc_sock_release(rqstp); ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [RFC,PATCH 02/20] svc: xpt_detach and xpt_free
Add transport switch functions to ensure that no additional receive ready events will be delivered by the transport (xpt_detach), and another to free memory associated with the transport (xpt_free). Change svc_delete_socket() and svc_sock_put() to use the new transport functions. Signed-off-by: Greg Banks [EMAIL PROTECTED] Signed-off-by: Peter Leckie [EMAIL PROTECTED] Signed-off-by: Tom Tucker [EMAIL PROTECTED] --- include/linux/sunrpc/svcsock.h | 12 ++ net/sunrpc/svcsock.c | 50 +--- 2 files changed, 53 insertions(+), 9 deletions(-) diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h index 4792ed6..27c5b1f 100644 --- a/include/linux/sunrpc/svcsock.h +++ b/include/linux/sunrpc/svcsock.h @@ -15,6 +15,18 @@ struct svc_xprt { const char *xpt_name; int (*xpt_recvfrom)(struct svc_rqst *rqstp); int (*xpt_sendto)(struct svc_rqst *rqstp); + /* +* Detach the svc_sock from it's socket, so that the +* svc_sock will not be enqueued any more. This is +* the first stage in the destruction of a svc_sock. +*/ + void(*xpt_detach)(struct svc_sock *); + /* +* Release all network-level resources held by the svc_sock, +* and the svc_sock itself. This is the final stage in the +* destruction of a svc_sock. +*/ + void(*xpt_free)(struct svc_sock *); }; /* diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index 789d94a..4956c88 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -84,6 +84,8 @@ static void svc_udp_data_ready(struct s static int svc_udp_recvfrom(struct svc_rqst *); static int svc_udp_sendto(struct svc_rqst *); static voidsvc_close_socket(struct svc_sock *svsk); +static voidsvc_sock_detach(struct svc_sock *); +static voidsvc_sock_free(struct svc_sock *); static struct svc_deferred_req *svc_deferred_dequeue(struct svc_sock *svsk); static int svc_deferred_recv(struct svc_rqst *rqstp); @@ -378,14 +380,9 @@ svc_sock_put(struct svc_sock *svsk) if (atomic_dec_and_test(svsk-sk_inuse)) { BUG_ON(! test_bit(SK_DEAD, svsk-sk_flags)); - dprintk(svc: releasing dead socket\n); - if (svsk-sk_sock-file) - sockfd_put(svsk-sk_sock); - else - sock_release(svsk-sk_sock); if (svsk-sk_info_authunix != NULL) svcauth_unix_info_release(svsk-sk_info_authunix); - kfree(svsk); + svsk-sk_xprt-xpt_free(svsk); } } @@ -889,6 +886,8 @@ static const struct svc_xprt svc_udp_xpr .xpt_name = udp, .xpt_recvfrom = svc_udp_recvfrom, .xpt_sendto = svc_udp_sendto, + .xpt_detach = svc_sock_detach, + .xpt_free = svc_sock_free, }; static void @@ -1331,6 +1330,8 @@ static const struct svc_xprt svc_tcp_xpr .xpt_name = tcp, .xpt_recvfrom = svc_tcp_recvfrom, .xpt_sendto = svc_tcp_sendto, + .xpt_detach = svc_sock_detach, + .xpt_free = svc_sock_free, }; static void @@ -1770,6 +1771,38 @@ bummer: } /* + * Detach the svc_sock from the socket so that no + * more callbacks occur. + */ +static void +svc_sock_detach(struct svc_sock *svsk) +{ + struct sock *sk = svsk-sk_sk; + + dprintk(svc: svc_sock_detach(%p)\n, svsk); + + /* put back the old socket callbacks */ + sk-sk_state_change = svsk-sk_ostate; + sk-sk_data_ready = svsk-sk_odata; + sk-sk_write_space = svsk-sk_owspace; +} + +/* + * Free the svc_sock's socket resources and the svc_sock itself. + */ +static void +svc_sock_free(struct svc_sock *svsk) +{ + dprintk(svc: svc_sock_free(%p)\n, svsk); + + if (svsk-sk_sock-file) + sockfd_put(svsk-sk_sock); + else + sock_release(svsk-sk_sock); + kfree(svsk); +} + +/* * Remove a dead socket */ static void @@ -1783,9 +1816,8 @@ svc_delete_socket(struct svc_sock *svsk) serv = svsk-sk_server; sk = svsk-sk_sk; - sk-sk_state_change = svsk-sk_ostate; - sk-sk_data_ready = svsk-sk_odata; - sk-sk_write_space = svsk-sk_owspace; + if (svsk-sk_xprt-xpt_detach) + svsk-sk_xprt-xpt_detach(svsk); spin_lock_bh(serv-sv_lock); ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [RFC,PATCH 03/20] svc: xpt_prep_reply_hdr
Add a transport function that prepares the transport specific header for RPC replies. UDP has none, TCP has a 4B record length. This will allow the RDMA transport to prepare it's variable length reply header as well. Signed-off-by: Tom Tucker [EMAIL PROTECTED] --- include/linux/sunrpc/svcsock.h |4 net/sunrpc/svc.c |8 +--- net/sunrpc/svcsock.c | 15 +++ 3 files changed, 24 insertions(+), 3 deletions(-) diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h index 27c5b1f..1da42c2 100644 --- a/include/linux/sunrpc/svcsock.h +++ b/include/linux/sunrpc/svcsock.h @@ -27,6 +27,10 @@ struct svc_xprt { * destruction of a svc_sock. */ void(*xpt_free)(struct svc_sock *); + /* +* Prepare any transport-specific RPC header. +*/ + int (*xpt_prep_reply_hdr)(struct svc_rqst *); }; /* diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c index e673ef9..72a900f 100644 --- a/net/sunrpc/svc.c +++ b/net/sunrpc/svc.c @@ -815,9 +815,11 @@ svc_process(struct svc_rqst *rqstp) rqstp-rq_res.tail[0].iov_len = 0; /* Will be turned off only in gss privacy case: */ rqstp-rq_sendfile_ok = 1; - /* tcp needs a space for the record length... */ - if (rqstp-rq_prot == IPPROTO_TCP) - svc_putnl(resv, 0); + + /* setup response header. */ + if (rqstp-rq_sock-sk_xprt-xpt_prep_reply_hdr + rqstp-rq_sock-sk_xprt-xpt_prep_reply_hdr(rqstp)) + goto dropit; rqstp-rq_xid = svc_getu32(argv); svc_putu32(resv, rqstp-rq_xid); diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index 4956c88..ca473ee 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -1326,12 +1326,27 @@ svc_tcp_sendto(struct svc_rqst *rqstp) return sent; } +/* + * Setup response header. TCP has a 4B record length field. + */ +static int +svc_tcp_prep_reply_hdr(struct svc_rqst *rqstp) +{ + struct kvec *resv = rqstp-rq_res.head[0]; + + /* tcp needs a space for the record length... */ + svc_putnl(resv, 0); + + return 0; +} + static const struct svc_xprt svc_tcp_xprt = { .xpt_name = tcp, .xpt_recvfrom = svc_tcp_recvfrom, .xpt_sendto = svc_tcp_sendto, .xpt_detach = svc_sock_detach, .xpt_free = svc_sock_free, + .xpt_prep_reply_hdr = svc_tcp_prep_reply_hdr, }; static void ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [RFC,PATCH 04/20] svc: xpt_has_wspace
Move the code that checks for available write space on the socket, into a new transport function. This will allow transports flexibility when determining if enough space/memory is available to process the reply. The role of this function for RDMA is to avoid stalling an knfsd thread when SQ space is not available. Signed-off-by: Greg Banks [EMAIL PROTECTED] Signed-off-by: Peter Leckie [EMAIL PROTECTED] Signed-off-by: Tom Tucker [EMAIL PROTECTED] --- include/linux/sunrpc/svcsock.h |4 ++ net/sunrpc/svcsock.c | 75 ++-- 2 files changed, 52 insertions(+), 27 deletions(-) diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h index 1da42c2..3faa95c 100644 --- a/include/linux/sunrpc/svcsock.h +++ b/include/linux/sunrpc/svcsock.h @@ -31,6 +31,10 @@ struct svc_xprt { * Prepare any transport-specific RPC header. */ int (*xpt_prep_reply_hdr)(struct svc_rqst *); + /* +* Return 1 if sufficient space to write reply to network. +*/ + int (*xpt_has_wspace)(struct svc_sock *); }; /* diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index ca473ee..b16dad4 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -205,22 +205,6 @@ svc_release_skb(struct svc_rqst *rqstp) } /* - * Any space to write? - */ -static inline unsigned long -svc_sock_wspace(struct svc_sock *svsk) -{ - int wspace; - - if (svsk-sk_sock-type == SOCK_STREAM) - wspace = sk_stream_wspace(svsk-sk_sk); - else - wspace = sock_wspace(svsk-sk_sk); - - return wspace; -} - -/* * Queue up a socket with data pending. If there are idle nfsd * processes, wake 'em up. * @@ -269,21 +253,13 @@ svc_sock_enqueue(struct svc_sock *svsk) BUG_ON(svsk-sk_pool != NULL); svsk-sk_pool = pool; - set_bit(SOCK_NOSPACE, svsk-sk_sock-flags); - if (((atomic_read(svsk-sk_reserved) + serv-sv_max_mesg)*2 - svc_sock_wspace(svsk)) -!test_bit(SK_CLOSE, svsk-sk_flags) -!test_bit(SK_CONN, svsk-sk_flags)) { - /* Don't enqueue while not enough space for reply */ - dprintk(svc: socket %p no space, %d*2 %ld, not enqueued\n, - svsk-sk_sk, atomic_read(svsk-sk_reserved)+serv-sv_max_mesg, - svc_sock_wspace(svsk)); + if (!test_bit(SK_CLOSE, svsk-sk_flags) +!test_bit(SK_CONN, svsk-sk_flags) +!svsk-sk_xprt-xpt_has_wspace(svsk)) { svsk-sk_pool = NULL; clear_bit(SK_BUSY, svsk-sk_flags); goto out_unlock; } - clear_bit(SOCK_NOSPACE, svsk-sk_sock-flags); - if (!list_empty(pool-sp_threads)) { rqstp = list_entry(pool-sp_threads.next, @@ -882,12 +858,45 @@ svc_udp_sendto(struct svc_rqst *rqstp) return error; } +/** + * svc_sock_has_write_space - Checks if there is enough space + * to send the reply on the socket. + * @svsk: the svc_sock to write on + * @wspace: the number of bytes available for writing + */ +static int svc_sock_has_write_space(struct svc_sock *svsk, int wspace) +{ + struct svc_serv *serv = svsk-sk_server; + int required = atomic_read(svsk-sk_reserved) + serv-sv_max_mesg; + + if (required*2 wspace) { + /* Don't enqueue while not enough space for reply */ + dprintk(svc: socket %p no space, %d*2 %d, not enqueued\n, + svsk-sk_sk, required, wspace); + return 0; + } + clear_bit(SOCK_NOSPACE, svsk-sk_sock-flags); + return 1; +} + +static int +svc_udp_has_wspace(struct svc_sock *svsk) +{ + /* +* Set the SOCK_NOSPACE flag before checking the available +* sock space. +*/ + set_bit(SOCK_NOSPACE, svsk-sk_sock-flags); + return svc_sock_has_write_space(svsk, sock_wspace(svsk-sk_sk)); +} + static const struct svc_xprt svc_udp_xprt = { .xpt_name = udp, .xpt_recvfrom = svc_udp_recvfrom, .xpt_sendto = svc_udp_sendto, .xpt_detach = svc_sock_detach, .xpt_free = svc_sock_free, + .xpt_has_wspace = svc_udp_has_wspace, }; static void @@ -1340,6 +1349,17 @@ svc_tcp_prep_reply_hdr(struct svc_rqst * return 0; } +static int +svc_tcp_has_wspace(struct svc_sock *svsk) +{ + /* +* Set the SOCK_NOSPACE flag before checking the available +* sock space. +*/ + set_bit(SOCK_NOSPACE, svsk-sk_sock-flags); + return svc_sock_has_write_space(svsk, sk_stream_wspace(svsk-sk_sk)); +} + static const struct svc_xprt svc_tcp_xprt = { .xpt_name = tcp, .xpt_recvfrom = svc_tcp_recvfrom, @@ -1347,6 +1367,7 @@ static const struct svc_xprt svc_tcp_xpr .xpt_detach = svc_sock_detach, .xpt_free = svc_sock_free, .xpt_prep_reply_hdr
[ofa-general] [RFC, PATCH 06/20] svc: export svc_sock_enqueue, svc_sock_received
Export svc_sock_enqueue() and svc_sock_received() so they can be used by sunrpc server transport implementations (even future modular ones). Signed-off-by: Greg Banks [EMAIL PROTECTED] Signed-off-by: Peter Leckie [EMAIL PROTECTED] Signed-off-by: Tom Tucker [EMAIL PROTECTED] --- include/linux/sunrpc/svcsock.h |2 ++ net/sunrpc/svcsock.c |7 --- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h index 4e24e6d..0145057 100644 --- a/include/linux/sunrpc/svcsock.h +++ b/include/linux/sunrpc/svcsock.h @@ -108,6 +108,8 @@ int svc_addsock(struct svc_serv *serv, int fd, char *name_return, int *proto); +void svc_sock_enqueue(struct svc_sock *svsk); +void svc_sock_received(struct svc_sock *svsk); /* * svc_makesock socket characteristics diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index 0dc94a8..8fad53d 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -209,7 +209,7 @@ svc_release_skb(struct svc_rqst *rqstp) * processes, wake 'em up. * */ -static void +void svc_sock_enqueue(struct svc_sock *svsk) { struct svc_serv *serv = svsk-sk_server; @@ -287,6 +287,7 @@ svc_sock_enqueue(struct svc_sock *svsk) out_unlock: spin_unlock_bh(pool-sp_lock); } +EXPORT_SYMBOL_GPL(svc_sock_enqueue); /* * Dequeue the first socket. Must be called with the pool-sp_lock held. @@ -315,14 +316,14 @@ svc_sock_dequeue(struct svc_pool *pool) * Note: SK_DATA only gets cleared when a read-attempt finds * no (or insufficient) data. */ -static inline void +void svc_sock_received(struct svc_sock *svsk) { svsk-sk_pool = NULL; clear_bit(SK_BUSY, svsk-sk_flags); svc_sock_enqueue(svsk); } - +EXPORT_SYMBOL_GPL(svc_sock_received); /** * svc_reserve - change the space reserved for the reply to a request. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [RFC,PATCH 10/20] svc: Add generic refcount services
Add inline svc_sock_get() so that service transport code will not need to manipulate sk_inuse directly. Also, make svc_sock_put() available so that transport code outside svcsock.c can use it. Signed-off-by: Greg Banks [EMAIL PROTECTED] Signed-off-by: Tom Tucker [EMAIL PROTECTED] --- include/linux/sunrpc/svcsock.h | 15 +++ net/sunrpc/svcsock.c | 29 ++--- 2 files changed, 29 insertions(+), 15 deletions(-) diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h index ea8b62b..9f37f30 100644 --- a/include/linux/sunrpc/svcsock.h +++ b/include/linux/sunrpc/svcsock.h @@ -115,6 +115,7 @@ int svc_addsock(struct svc_serv *serv, int *proto); void svc_sock_enqueue(struct svc_sock *svsk); void svc_sock_received(struct svc_sock *svsk); +void __svc_sock_put(struct svc_sock *svsk); /* * svc_makesock socket characteristics @@ -123,4 +124,18 @@ #define SVC_SOCK_DEFAULTS (0U) #define SVC_SOCK_ANONYMOUS (1U 0) /* don't register with pmap */ #define SVC_SOCK_TEMPORARY (1U 1) /* flag socket as temporary */ +/* + * Take and drop a temporary reference count on the svc_sock. + */ +static inline void svc_sock_get(struct svc_sock *svsk) +{ + atomic_inc(svsk-sk_inuse); +} + +static inline void svc_sock_put(struct svc_sock *svsk) +{ + if (atomic_dec_and_test(svsk-sk_inuse)) + __svc_sock_put(svsk); +} + #endif /* SUNRPC_SVCSOCK_H */ diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index dcb5c7a..02f682a 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -273,7 +273,7 @@ svc_sock_enqueue(struct svc_sock *svsk) svc_sock_enqueue: server %p, rq_sock=%p!\n, rqstp, rqstp-rq_sock); rqstp-rq_sock = svsk; - atomic_inc(svsk-sk_inuse); + svc_sock_get(svsk); rqstp-rq_reserved = serv-sv_max_mesg; atomic_add(rqstp-rq_reserved, svsk-sk_reserved); BUG_ON(svsk-sk_pool != pool); @@ -351,17 +351,16 @@ void svc_reserve(struct svc_rqst *rqstp, /* * Release a socket after use. */ -static inline void -svc_sock_put(struct svc_sock *svsk) +void +__svc_sock_put(struct svc_sock *svsk) { - if (atomic_dec_and_test(svsk-sk_inuse)) { - BUG_ON(! test_bit(SK_DEAD, svsk-sk_flags)); + BUG_ON(! test_bit(SK_DEAD, svsk-sk_flags)); - if (svsk-sk_info_authunix != NULL) - svcauth_unix_info_release(svsk-sk_info_authunix); - svsk-sk_xprt-xpt_free(svsk); - } + if (svsk-sk_info_authunix != NULL) + svcauth_unix_info_release(svsk-sk_info_authunix); + svsk-sk_xprt-xpt_free(svsk); } +EXPORT_SYMBOL_GPL(__svc_sock_put); static void svc_sock_release(struct svc_rqst *rqstp) @@ -1109,7 +1108,7 @@ svc_tcp_accept(struct svc_sock *svsk) struct svc_sock, sk_list); set_bit(SK_CLOSE, svsk-sk_flags); - atomic_inc(svsk-sk_inuse); + svc_sock_get(svsk); } spin_unlock_bh(serv-sv_lock); @@ -1481,7 +1480,7 @@ svc_recv(struct svc_rqst *rqstp, long ti spin_lock_bh(pool-sp_lock); if ((svsk = svc_sock_dequeue(pool)) != NULL) { rqstp-rq_sock = svsk; - atomic_inc(svsk-sk_inuse); + svc_sock_get(svsk); rqstp-rq_reserved = serv-sv_max_mesg; atomic_add(rqstp-rq_reserved, svsk-sk_reserved); } else { @@ -1620,7 +1619,7 @@ svc_age_temp_sockets(unsigned long closu continue; if (atomic_read(svsk-sk_inuse) || test_bit(SK_BUSY, svsk-sk_flags)) continue; - atomic_inc(svsk-sk_inuse); + svc_sock_get(svsk); list_move(le, to_be_aged); set_bit(SK_CLOSE, svsk-sk_flags); set_bit(SK_DETACHED, svsk-sk_flags); @@ -1868,7 +1867,7 @@ svc_delete_socket(struct svc_sock *svsk) */ if (!test_and_set_bit(SK_DEAD, svsk-sk_flags)) { BUG_ON(atomic_read(svsk-sk_inuse)2); - atomic_dec(svsk-sk_inuse); + svc_sock_put(svsk); if (test_bit(SK_TEMP, svsk-sk_flags)) serv-sv_tmpcnt--; } @@ -1883,7 +1882,7 @@ static void svc_close_socket(struct svc_ /* someone else will have to effect the close */ return; - atomic_inc(svsk-sk_inuse); + svc_sock_get(svsk); svc_delete_socket(svsk); clear_bit(SK_BUSY, svsk-sk_flags); svc_sock_put(svsk); @@ -1976,7 +1975,7 @@ svc_defer(struct cache_req *req) dr-argslen = rqstp-rq_arg.len 2; memcpy(dr
[ofa-general] [RFC,PATCH 13/20] svc: Add svc_[un]register_transport
Add an exported function for transport modules to [un]register themselves with the sunrpc server side transport switch. Signed-off-by: Tom Tucker [EMAIL PROTECTED] --- include/linux/sunrpc/svcsock.h |6 + net/sunrpc/svcsock.c | 50 2 files changed, 56 insertions(+), 0 deletions(-) diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h index 7def951..cc911ab 100644 --- a/include/linux/sunrpc/svcsock.h +++ b/include/linux/sunrpc/svcsock.h @@ -13,6 +13,7 @@ #include linux/sunrpc/svc.h struct svc_xprt { const char *xpt_name; + struct module *xpt_owner; int (*xpt_recvfrom)(struct svc_rqst *rqstp); int (*xpt_sendto)(struct svc_rqst *rqstp); /* @@ -45,7 +46,10 @@ struct svc_xprt { * Accept a pending connection, for connection-oriented transports */ int (*xpt_accept)(struct svc_sock *svsk); + /* Transport list link */ + struct list_headxpt_list; }; +extern struct list_head svc_transport_list; /* * RPC server socket. @@ -102,6 +106,8 @@ #define SK_LISTENER 11 /* listener (e. /* * Function prototypes. */ +intsvc_register_transport(struct svc_xprt *xprt); +intsvc_unregister_transport(struct svc_xprt *xprt); intsvc_makesock(struct svc_serv *, int, unsigned short, int flags); void svc_force_close_socket(struct svc_sock *); intsvc_recv(struct svc_rqst *, long); diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index 6acf22f..6183951 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -91,6 +91,54 @@ static struct svc_deferred_req *svc_defe static int svc_deferred_recv(struct svc_rqst *rqstp); static struct cache_deferred_req *svc_defer(struct cache_req *req); +/* List of registered transports */ +static spinlock_t svc_transport_lock = SPIN_LOCK_UNLOCKED; +LIST_HEAD(svc_transport_list); + +int svc_register_transport(struct svc_xprt *xprt) +{ + struct svc_xprt *ops; + int res; + + dprintk(svc: Adding svc transport '%s'\n, + xprt-xpt_name); + + res = -EEXIST; + INIT_LIST_HEAD(xprt-xpt_list); + spin_lock(svc_transport_lock); + list_for_each_entry(ops, svc_transport_list, xpt_list) { + if (xprt == ops) + goto out; + } + list_add_tail(xprt-xpt_list, svc_transport_list); + res = 0; +out: + spin_unlock(svc_transport_lock); + return res; +} +EXPORT_SYMBOL_GPL(svc_register_transport); + +int svc_unregister_transport(struct svc_xprt *xprt) +{ + struct svc_xprt *ops; + int res = 0; + + dprintk(svc: Removing svc transport '%s'\n, xprt-xpt_name); + + spin_lock(svc_transport_lock); + list_for_each_entry(ops, svc_transport_list, xpt_list) { + if (xprt == ops) { + list_del_init(ops-xpt_list); + goto out; + } + } + res = -ENOENT; + out: + spin_unlock(svc_transport_lock); + return res; +} +EXPORT_SYMBOL_GPL(svc_unregister_transport); + /* apparently the standard is that clients close * idle connections after 5 minutes, servers after * 6 minutes @@ -887,6 +935,7 @@ svc_udp_has_wspace(struct svc_sock *svsk static const struct svc_xprt svc_udp_xprt = { .xpt_name = udp, + .xpt_owner = THIS_MODULE, .xpt_recvfrom = svc_udp_recvfrom, .xpt_sendto = svc_udp_sendto, .xpt_detach = svc_sock_detach, @@ -1346,6 +1395,7 @@ svc_tcp_has_wspace(struct svc_sock *svsk static const struct svc_xprt svc_tcp_xprt = { .xpt_name = tcp, + .xpt_owner = THIS_MODULE, .xpt_recvfrom = svc_tcp_recvfrom, .xpt_sendto = svc_tcp_sendto, .xpt_detach = svc_sock_detach, ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [RFC,PATCH 14/20] svc: Register TCP/UDP Transports
Add a call to svc_register_transport for the built in transports UDP and TCP. The registration is done in the sunrpc module initialization logic. Signed-off-by: Tom Tucker [EMAIL PROTECTED] --- net/sunrpc/sunrpc_syms.c |2 ++ net/sunrpc/svcsock.c | 10 -- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/net/sunrpc/sunrpc_syms.c b/net/sunrpc/sunrpc_syms.c index 73075de..c68577b 100644 --- a/net/sunrpc/sunrpc_syms.c +++ b/net/sunrpc/sunrpc_syms.c @@ -134,6 +134,7 @@ EXPORT_SYMBOL(nfsd_debug); EXPORT_SYMBOL(nlm_debug); #endif +extern void init_svc_xprt(void); extern struct cache_detail ip_map_cache, unix_gid_cache; static int __init @@ -156,6 +157,7 @@ #endif cache_register(ip_map_cache); cache_register(unix_gid_cache); init_socket_xprt(); + init_svc_xprt(); out: return err; } diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index 6183951..d6443e8 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -933,7 +933,7 @@ svc_udp_has_wspace(struct svc_sock *svsk return svc_sock_has_write_space(svsk, sock_wspace(svsk-sk_sk)); } -static const struct svc_xprt svc_udp_xprt = { +static struct svc_xprt svc_udp_xprt = { .xpt_name = udp, .xpt_owner = THIS_MODULE, .xpt_recvfrom = svc_udp_recvfrom, @@ -1393,7 +1393,7 @@ svc_tcp_has_wspace(struct svc_sock *svsk return svc_sock_has_write_space(svsk, sk_stream_wspace(svsk-sk_sk)); } -static const struct svc_xprt svc_tcp_xprt = { +static struct svc_xprt svc_tcp_xprt = { .xpt_name = tcp, .xpt_owner = THIS_MODULE, .xpt_recvfrom = svc_tcp_recvfrom, @@ -1406,6 +1406,12 @@ static const struct svc_xprt svc_tcp_xpr .xpt_accept = svc_tcp_accept, }; +void init_svc_xprt(void) +{ + svc_register_transport(svc_udp_xprt); + svc_register_transport(svc_tcp_xprt); +} + static void svc_tcp_init_listener(struct svc_sock *svsk) { ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [RFC,PATCH 15/20] svc: transport file implementation
Create a proc/sys/sunrpc/transport file that contains information about the currently registered transports. Signed-off-by: Tom Tucker [EMAIL PROTECTED] --- include/linux/sunrpc/debug.h |1 + net/sunrpc/svcsock.c | 28 net/sunrpc/sysctl.c | 40 +++- 3 files changed, 68 insertions(+), 1 deletions(-) diff --git a/include/linux/sunrpc/debug.h b/include/linux/sunrpc/debug.h index 10709cb..89458df 100644 --- a/include/linux/sunrpc/debug.h +++ b/include/linux/sunrpc/debug.h @@ -88,6 +88,7 @@ enum { CTL_SLOTTABLE_TCP, CTL_MIN_RESVPORT, CTL_MAX_RESVPORT, + CTL_TRANSPORTS, }; #endif /* _LINUX_SUNRPC_DEBUG_H_ */ diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index d6443e8..276737e 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -139,6 +139,34 @@ int svc_unregister_transport(struct svc_ } EXPORT_SYMBOL_GPL(svc_unregister_transport); +/* + * Format the transport list for printing + */ +int svc_print_transports(char *buf, int maxlen) +{ + struct list_head *le; + char tmpstr[80]; + int len = 0; + buf[0] = '\0'; + + spin_lock(svc_transport_lock); + list_for_each(le, svc_transport_list) { + int slen; + struct svc_xprt *xprt = + list_entry(le, struct svc_xprt, xpt_list); + + sprintf(tmpstr, %s %d\n, xprt-xpt_name, xprt-xpt_max_payload); + slen = strlen(tmpstr); + if (len + slen maxlen) + break; + len += slen; + strcat(buf, tmpstr); + } + spin_unlock(svc_transport_lock); + + return len; +} + /* apparently the standard is that clients close * idle connections after 5 minutes, servers after * 6 minutes diff --git a/net/sunrpc/sysctl.c b/net/sunrpc/sysctl.c index 738db32..683cf90 100644 --- a/net/sunrpc/sysctl.c +++ b/net/sunrpc/sysctl.c @@ -27,6 +27,9 @@ unsigned int nfs_debug; unsigned int nfsd_debug; unsigned int nlm_debug; +/* Transport string */ +char xprt_buf[128]; + #ifdef RPC_DEBUG static struct ctl_table_header *sunrpc_table_header; @@ -48,6 +51,34 @@ rpc_unregister_sysctl(void) } } +int svc_print_transports(char *buf, int maxlen); +static int proc_do_xprt(ctl_table *table, int write, struct file *file, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + char tmpbuf[128]; + int len; + if ((*ppos !write) || !*lenp) { + *lenp = 0; + return 0; + } + + if (write) + return -EINVAL; + else { + + len = svc_print_transports(tmpbuf, 128); + if (!access_ok(VERIFY_WRITE, buffer, len)) + return -EFAULT; + + if (__copy_to_user(buffer, tmpbuf, len)) + return -EFAULT; + } + + *lenp -= len; + *ppos += len; + return 0; +} + static int proc_dodebug(ctl_table *table, int write, struct file *file, void __user *buffer, size_t *lenp, loff_t *ppos) @@ -111,7 +142,6 @@ done: return 0; } - static ctl_table debug_table[] = { { .ctl_name = CTL_RPCDEBUG, @@ -145,6 +175,14 @@ static ctl_table debug_table[] = { .mode = 0644, .proc_handler = proc_dodebug }, + { + .ctl_name = CTL_TRANSPORTS, + .procname = transports, + .data = xprt_buf, + .maxlen = sizeof(xprt_buf), + .mode = 0444, + .proc_handler = proc_do_xprt, + }, { .ctl_name = 0 } }; ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [RFC, PATCH 20/20] knfsd: create listener via portlist write
Update the write handler for the portlist file to allow creating new listening endpoints on a transport. The general form of the string is: transport_namespaceport number For example: tcp 2049 This is intended to support the creation of a listening endpoint for RDMA transports without adding #ifdef code to the nfssvc.c file. The built-in transports UDP/TCP were left in the nfssvc initialization code to avoid having to change rpc.nfsd, etc... Signed-off-by: Tom Tucker [EMAIL PROTECTED] --- fs/nfsd/nfsctl.c | 17 + 1 files changed, 17 insertions(+), 0 deletions(-) diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c index 71c686d..da2abda 100644 --- a/fs/nfsd/nfsctl.c +++ b/fs/nfsd/nfsctl.c @@ -555,6 +555,23 @@ static ssize_t write_ports(struct file * kfree(toclose); return len; } + /* This implements the ability to add a transport by writing +* it's transport name to the portlist file +*/ + if (isalnum(buf[0])) { + int err; + char transport[16]; + int port; + if (sscanf(buf, %15s %4d, transport, port) == 2) { + err = nfsd_create_serv(); + if (!err) + err = svc_create_svcsock(nfsd_serv, +transport, port, +SVC_SOCK_ANONYMOUS); + return err 0 ? err : 0; + } + } + return -EINVAL; } ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [RFC,PATCH 01/10] rdma: ONCRPC RDMA Header File
These are the core data types that are used to process the ONCRPC protocol in the NFS-RDMA client and server. Signed-off-by: Tom Talpey [EMAIL PROTECTED] --- include/linux/sunrpc/rpc_rdma.h | 116 +++ 1 files changed, 116 insertions(+), 0 deletions(-) diff --git a/include/linux/sunrpc/rpc_rdma.h b/include/linux/sunrpc/rpc_rdma.h new file mode 100644 index 000..0013a0d --- /dev/null +++ b/include/linux/sunrpc/rpc_rdma.h @@ -0,0 +1,116 @@ +/* + * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the BSD-type + * license below: + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials provided + * with the distribution. + * + * Neither the name of the Network Appliance, Inc. nor the names of + * its contributors may be used to endorse or promote products + * derived from this software without specific prior written + * permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#ifndef _LINUX_SUNRPC_RPC_RDMA_H +#define _LINUX_SUNRPC_RPC_RDMA_H + +struct rpcrdma_segment { + uint32_t rs_handle; /* Registered memory handle */ + uint32_t rs_length; /* Length of the chunk in bytes */ + uint64_t rs_offset; /* Chunk virtual address or offset */ +}; + +/* + * read chunk(s), encoded as a linked list. + */ +struct rpcrdma_read_chunk { + uint32_t rc_discrim;/* 1 indicates presence */ + uint32_t rc_position; /* Position in XDR stream */ + struct rpcrdma_segment rc_target; +}; + +/* + * write chunk, and reply chunk. + */ +struct rpcrdma_write_chunk { + struct rpcrdma_segment wc_target; +}; + +/* + * write chunk(s), encoded as a counted array. + */ +struct rpcrdma_write_array { + uint32_t wc_discrim;/* 1 indicates presence */ + uint32_t wc_nchunks;/* Array count */ + struct rpcrdma_write_chunk wc_array[0]; +}; + +struct rpcrdma_msg { + uint32_t rm_xid;/* Mirrors the RPC header xid */ + uint32_t rm_vers; /* Version of this protocol */ + uint32_t rm_credit; /* Buffers requested/granted */ + uint32_t rm_type; /* Type of message (enum rpcrdma_proc) */ + union { + + struct {/* no chunks */ + uint32_t rm_empty[3]; /* 3 empty chunk lists */ + } rm_nochunks; + + struct {/* no chunks and padded */ + uint32_t rm_align; /* Padding alignment */ + uint32_t rm_thresh; /* Padding threshold */ + uint32_t rm_pempty[3]; /* 3 empty chunk lists */ + } rm_padded; + + uint32_t rm_chunks[0]; /* read, write and reply chunks */ + + } rm_body; +}; + +#define RPCRDMA_HDRLEN_MIN 28 + +enum rpcrdma_errcode { + ERR_VERS = 1, + ERR_CHUNK = 2 +}; + +struct rpcrdma_err_vers { + uint32_t rdma_vers_low; /* Version range supported by peer */ + uint32_t rdma_vers_high; +}; + +enum rpcrdma_proc { + RDMA_MSG = 0, /* An RPC call or reply msg */ + RDMA_NOMSG = 1, /* An RPC call or reply msg - separate body */ + RDMA_MSGP = 2, /* An RPC call or reply msg with padding */ + RDMA_DONE = 3, /* Client signals reply completion */ + RDMA_ERROR = 4 /* An RPC RDMA encoding error */ +}; + +#endif /* _LINUX_SUNRPC_RPC_RDMA_H */
[ofa-general] [RFC,PATCH 03/10] rdma: SVCRMDA Header File
This file defines the data types used by the SVCRDMA transport module. The principle data structure is the transport specific extension to the svcxprt structure. Signed-off-by: Tom Tucker [EMAIL PROTECTED] --- include/linux/sunrpc/svc_rdma.h | 261 +++ 1 files changed, 261 insertions(+), 0 deletions(-) diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h new file mode 100644 index 000..0bad94b --- /dev/null +++ b/include/linux/sunrpc/svc_rdma.h @@ -0,0 +1,261 @@ +/* + * Copyright (c) 2005-2006 Network Appliance, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the BSD-type + * license below: + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials provided + * with the distribution. + * + * Neither the name of the Network Appliance, Inc. nor the names of + * its contributors may be used to endorse or promote products + * derived from this software without specific prior written + * permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * Author: Tom Tucker [EMAIL PROTECTED] + */ + +#ifndef SVC_RDMA_H +#define SVC_RDMA_H +#include linux/sunrpc/xdr.h +#include linux/sunrpc/svcsock.h +#include linux/sunrpc/rpc_rdma.h +#include rdma/ib_verbs.h +#include rdma/rdma_cm.h +#define SVCRDMA_DEBUG + +/* RPC/RDMA parameters */ +extern unsigned int svcrdma_ord; +extern unsigned int svcrdma_max_requests; +extern unsigned int svcrdma_max_req_size; +extern unsigned int rdma_stat_recv; +extern unsigned int rdma_stat_read; +extern unsigned int rdma_stat_write; +extern unsigned int rdma_stat_sq_starve; +extern unsigned int rdma_stat_rq_starve; +extern unsigned int rdma_stat_rq_poll; +extern unsigned int rdma_stat_rq_prod; +extern unsigned int rdma_stat_sq_poll; +extern unsigned int rdma_stat_sq_prod; + +#define RPCRDMA_VERSION 1 + +/* + * Contexts are built when an RDMA request is created and are a + * record of the resources that can be recovered when the request + * completes. + */ +struct svc_rdma_op_ctxt { + struct svc_rdma_op_ctxt *next; + struct xdr_buf arg; + struct list_head dto_q; + enum ib_wr_opcode wr_op; + enum ib_wc_status wc_status; + u32 byte_len; + struct svcxprt_rdma *xprt; + unsigned long flags; + enum dma_data_direction direction; + int count; + struct ib_sge sge[RPCSVC_MAXPAGES]; + struct page *pages[RPCSVC_MAXPAGES]; +}; + +#define RDMACTXT_F_READ_DONE 1 +#define RDMACTXT_F_LAST_CTXT 2 + +struct svc_rdma_deferred_req { + struct svc_deferred_req req; + struct page *arg_page; + int arg_len; +}; + +struct svcxprt_rdma { + struct svc_sock sc_xprt; /* SVC transport structure */ + struct rdma_cm_id*sc_cm_id;/* RDMA connection id */ + struct list_head sc_accept_q; /* Conn. waiting accept */ + int sc_ord; /* RDMA read limit */ + wait_queue_head_tsc_read_wait; + int sc_max_sge; + + int sc_sq_depth; /* Depth of SQ */ + atomic_t sc_sq_count; /* Number of SQ WR on queue */ + + int sc_max_requests; /* Depth of RQ */ + int sc_max_req_size; /* Size of each RQ WR buf */ + + struct ib_pd *sc_pd; + + struct svc_rdma_op_ctxt *sc_ctxt_head; + int sc_ctxt_cnt; + int sc_ctxt_bump; + int sc_ctxt_max
[ofa-general] [RFC,PATCH 04/10] rdma: SVCRDMA Transport Module
This file implements the RDMA transport module initialization and termination logic and registers the transport sysctl variables. Signed-off-by: Tom Tucker [EMAIL PROTECTED] --- net/sunrpc/svc_rdma.c | 270 + 1 files changed, 270 insertions(+), 0 deletions(-) diff --git a/net/sunrpc/svc_rdma.c b/net/sunrpc/svc_rdma.c new file mode 100644 index 000..620249d --- /dev/null +++ b/net/sunrpc/svc_rdma.c @@ -0,0 +1,270 @@ +/* + * Copyright (c) 2005-2006 Network Appliance, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the BSD-type + * license below: + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials provided + * with the distribution. + * + * Neither the name of the Network Appliance, Inc. nor the names of + * its contributors may be used to endorse or promote products + * derived from this software without specific prior written + * permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * Author: Tom Tucker [EMAIL PROTECTED] + */ +#include linux/module.h +#include linux/init.h +#include linux/fs.h +#include linux/sysctl.h +#include linux/sunrpc/clnt.h +#include linux/sunrpc/sched.h +#include linux/sunrpc/svc_rdma.h + +#define RPCDBG_FACILITYRPCDBG_SVCXPRT + +/* RPC/RDMA parameters */ +unsigned int svcrdma_ord = RPCRDMA_ORD; +static unsigned int min_ord = 1; +static unsigned int max_ord = 4096; +unsigned int svcrdma_max_requests = RPCRDMA_MAX_REQUESTS; +static unsigned int min_max_requests = 4; +static unsigned int max_max_requests = 16384; +unsigned int svcrdma_max_req_size = RPCRDMA_MAX_REQ_SIZE; +static unsigned int min_max_inline = 4096; +static unsigned int max_max_inline = 65536; +static unsigned int zero = 0; +static unsigned int one = 1; + +unsigned int rdma_stat_recv = 0; +unsigned int rdma_stat_read = 0; +unsigned int rdma_stat_write = 0; +unsigned int rdma_stat_sq_starve = 0; +unsigned int rdma_stat_rq_starve = 0; +unsigned int rdma_stat_rq_poll = 0; +unsigned int rdma_stat_rq_prod = 0; +unsigned int rdma_stat_sq_poll = 0; +unsigned int rdma_stat_sq_prod = 0; + +extern struct svc_xprt svc_rdma_xprt; + +static struct ctl_table_header *svcrdma_table_header; +static ctl_table svcrdma_parm_table[] = { + { + .ctl_name = CTL_RDMA_MAX_REQUESTS, + .procname = max_requests, + .data = svcrdma_max_requests, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .strategy = sysctl_intvec, + .extra1 = min_max_requests, + .extra2 = max_max_requests + }, + { + .ctl_name = CTL_RDMA_MAX_REQ_SIZE, + .procname = max_req_size, + .data = svcrdma_max_req_size, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .strategy = sysctl_intvec, + .extra1 = min_max_inline, + .extra2 = max_max_inline + }, + { + .ctl_name = CTL_RDMA_ORD, + .procname = max_outbound_read_requests, + .data = svcrdma_ord, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax
[ofa-general] [RFC, PATCH 05/10] rdma: SVCRDMA Core Transport Services
This file implements the core transport data management and I/O path. The I/O path for RDMA involves receiving callbacks on interrupt context. Since all the svc transport locks are _bh locks we enqueue the transport on a list, schedule a tasklet to dequeue data indications from the RDMA completion queue. The tasklet in turn takes _bh locks to enqueue receive data indications on a list for the transport. The svc_rdma_recvfrom transport function dequeues data from this list in an NFSD thread context. Signed-off-by: Tom Tucker [EMAIL PROTECTED] --- net/sunrpc/svc_rdma_transport.c | 1207 +++ net/sunrpc/svcauth_unix.c |3 2 files changed, 1208 insertions(+), 2 deletions(-) diff --git a/net/sunrpc/svc_rdma_transport.c b/net/sunrpc/svc_rdma_transport.c new file mode 100644 index 000..3f1f251 --- /dev/null +++ b/net/sunrpc/svc_rdma_transport.c @@ -0,0 +1,1207 @@ +/* + * Copyright (c) 2005-2006 Network Appliance, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the BSD-type + * license below: + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials provided + * with the distribution. + * + * Neither the name of the Network Appliance, Inc. nor the names of + * its contributors may be used to endorse or promote products + * derived from this software without specific prior written + * permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * Author: Tom Tucker [EMAIL PROTECTED] + */ + +#include asm/semaphore.h +#include linux/device.h +#include linux/in.h +#include linux/err.h +#include linux/time.h +#include linux/delay.h + +#include linux/sunrpc/svcsock.h +#include linux/sunrpc/debug.h +#include linux/sunrpc/rpc_rdma.h +#include linux/mm.h /* num_physpages */ +#include linux/spinlock.h +#include linux/net.h +#include net/sock.h +#include asm/io.h +#include rdma/ib_verbs.h +#include rdma/rdma_cm.h +#include net/ipv6.h +#include linux/sunrpc/svc_rdma.h + +#define RPCDBG_FACILITYRPCDBG_SVCXPRT + +int svc_rdma_create_svc(struct svc_serv *serv, struct sockaddr *sa, int flags); +static int svc_rdma_accept(struct svc_sock *xprt); +static void rdma_destroy_xprt(struct svcxprt_rdma *xprt); +static void dto_tasklet_func(unsigned long data); +static struct cache_deferred_req *svc_rdma_defer(struct cache_req *req); +static void svc_rdma_detach(struct svc_sock *svsk); +static void svc_rdma_free(struct svc_sock *svsk); +static int svc_rdma_has_wspace(struct svc_sock *svsk); +static int svc_rdma_get_name(char *buf, struct svc_sock *svsk); + +static void rq_cq_reap(struct svcxprt_rdma *xprt); +static void sq_cq_reap(struct svcxprt_rdma *xprt); + +DECLARE_TASKLET(dto_tasklet, dto_tasklet_func, 0UL); +static spinlock_t dto_lock = SPIN_LOCK_UNLOCKED; +static LIST_HEAD(dto_xprt_q); + +struct svc_xprt svc_rdma_xprt = { + .xpt_name = rdma, + .xpt_owner = THIS_MODULE, + .xpt_create_svc = svc_rdma_create_svc, + .xpt_get_name = svc_rdma_get_name, + .xpt_recvfrom = svc_rdma_recvfrom, + .xpt_sendto = svc_rdma_sendto, + .xpt_detach = svc_rdma_detach, + .xpt_free = svc_rdma_free, + .xpt_has_wspace = svc_rdma_has_wspace, + .xpt_max_payload = RPCSVC_MAXPAYLOAD_TCP, + .xpt_accept = svc_rdma_accept, + .xpt_defer = svc_rdma_defer +}; + +static int rdma_bump_context_cache(struct svcxprt_rdma *xprt) +{ + int target; + int at_least_one = 0; + struct svc_rdma_op_ctxt *ctxt; + + target = min(xprt-sc_ctxt_cnt + xprt-sc_ctxt_bump
[ofa-general] [RFC,PATCH 06/10] rdma: SVCRDMA recvfrom
This file implements the RDMA transport recvfrom function. The function dequeues work reqeust completion contexts from an I/O list that it shares with the I/O tasklet in svc_rdma_transport.c. For ONCRPC RDMA, an RPC may not be complete when it is received. Instead, the RDMA header that precedes the RPC message informs the transport where to get the RPC data from on the client and where to place it in the RPC message before it is delivered to the server. The svc_rdma_recvfrom function therefore, parses this RDMA header and issues any necessary RDMA operations to fetch the remainder of the RPC from the client. Special handling is required when the request involves an RDMA_READ in this case, submits all of the RDMA_READ requests to the underlying transport driver and then returns 0 (EAGAIN). When the transport completes the last RDMA_READ for the request, it enqueues it on an read completion queue and enqueues the transport. The recvfrom code favors this queue over the regular DTO queue when satisfying reads. Signed-off-by: Tom Tucker [EMAIL PROTECTED] --- net/sunrpc/svc_rdma_recvfrom.c | 664 1 files changed, 664 insertions(+), 0 deletions(-) diff --git a/net/sunrpc/svc_rdma_recvfrom.c b/net/sunrpc/svc_rdma_recvfrom.c new file mode 100644 index 000..681f25a --- /dev/null +++ b/net/sunrpc/svc_rdma_recvfrom.c @@ -0,0 +1,664 @@ +/* + * Copyright (c) 2005-2006 Network Appliance, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the BSD-type + * license below: + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials provided + * with the distribution. + * + * Neither the name of the Network Appliance, Inc. nor the names of + * its contributors may be used to endorse or promote products + * derived from this software without specific prior written + * permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * Author: Tom Tucker [EMAIL PROTECTED] + */ + +#include asm/semaphore.h +#include linux/device.h +#include linux/in.h +#include linux/err.h +#include linux/time.h + +#include linux/sunrpc/svcsock.h +#include linux/sunrpc/debug.h +#include linux/sunrpc/rpc_rdma.h +#include linux/mm.h /* num_physpages */ +#include linux/spinlock.h +#include linux/net.h +#include net/sock.h +#include asm/io.h +#include asm/unaligned.h +#include rdma/ib_verbs.h +#include rdma/rdma_cm.h +#include linux/sunrpc/svc_rdma.h + +#define RPCDBG_FACILITYRPCDBG_SVCXPRT + +/* + * Replace the pages in the rq_argpages array with the pages from the SGE in + * the RDMA_RECV completion. The SGL should contain full pages up until the + * last one. + */ +static void rdma_build_arg_xdr(struct svc_rqst *rqstp, + struct svc_rdma_op_ctxt *ctxt, + u32 byte_count) +{ + struct page *page; + u32 bc; + int sge_no; + + /* Swap the page in the SGE with the page in argpages */ + page = ctxt-pages[0]; + put_page(rqstp-rq_pages[0]); + rqstp-rq_pages[0] = page; + + /* Set up the XDR head */ + rqstp-rq_arg.head[0].iov_base = page_address(page); + rqstp-rq_arg.head[0].iov_len = min(byte_count, ctxt-sge[0].length); + rqstp-rq_arg.len = byte_count; + rqstp-rq_arg.buflen = byte_count; + + /* Compute bytes past head in the SGL */ + bc = byte_count - rqstp-rq_arg.head[0].iov_len; + + /* If data remains, store it in the pagelist */ + rqstp-rq_arg.page_len = bc; + rqstp-rq_arg.page_base = 0
[ofa-general] [RFC,PATCH 07/10] rdma: SVCRDMA sendto
This file implements the RDMA transport sendto function. A RPC reply on an RDMA transport consists of some number of RDMA_WRITE requests followed by an RDMA_SEND request. The sendto function parses the ONCRPC RDMA reply header to determine how to send the reply back to the client. The send queue is sized so as to be able to send complete replies for requests in most cases. In the event that there are not enough SQ WR slots to reply, e.g. big data, the send will block the NFSD thread. The I/O callback functions in svc_rdma_transport.c that reap WR completions wake any waiters blocked on the SQ. In general, the goal is not to block NFSD threads and the has_wspace method stall requests when the SQ is nearly full. Signed-off-by: Tom Tucker [EMAIL PROTECTED] --- net/sunrpc/svc_rdma_sendto.c | 515 ++ 1 files changed, 515 insertions(+), 0 deletions(-) diff --git a/net/sunrpc/svc_rdma_sendto.c b/net/sunrpc/svc_rdma_sendto.c new file mode 100644 index 000..cd4b5ac --- /dev/null +++ b/net/sunrpc/svc_rdma_sendto.c @@ -0,0 +1,515 @@ +/* + * Copyright (c) 2005-2006 Network Appliance, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the BSD-type + * license below: + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials provided + * with the distribution. + * + * Neither the name of the Network Appliance, Inc. nor the names of + * its contributors may be used to endorse or promote products + * derived from this software without specific prior written + * permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * Author: Tom Tucker [EMAIL PROTECTED] + */ + +#include asm/semaphore.h +#include linux/device.h +#include linux/in.h +#include linux/err.h +#include linux/time.h + +#include linux/sunrpc/svcsock.h +#include linux/sunrpc/debug.h +#include linux/sunrpc/rpc_rdma.h +#include linux/mm.h /* num_physpages */ +#include linux/spinlock.h +#include linux/net.h +#include net/sock.h +#include asm/io.h +#include asm/unaligned.h +#include rdma/ib_verbs.h +#include rdma/rdma_cm.h +#include linux/sunrpc/svc_rdma.h + +#define RPCDBG_FACILITYRPCDBG_SVCXPRT + +/* Encode an XDR as an array of IB SGE + * + * Assumptions: + * - head[0] is physically contiguous. + * - tail[0] is physically contiguous. + * - pages[] is not physically or virtually contigous and consists of + * PAGE_SIZE elements. + * + * Output: + * SGE[0] reserved for RCPRDMA header + * SGE[1] data from xdr-head[] + * SGE[2..sge_count-2] data from xdr-pages[] + * SGE[sge_count-1]data from xdr-tail. + * + */ +static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt, +struct xdr_buf *xdr, +struct ib_sge *sge, +int *sge_count) +{ + /* Max we need is the length of the XDR / pagesize + one for +* head + one for tail + one for RPCRDMA header +*/ + int sge_max = (xdr-len+PAGE_SIZE-1) / PAGE_SIZE + 3; +int sge_no; + u32 byte_count = xdr-len; + u32 sge_bytes; + u32 page_bytes; + int page_off; + int page_no; + + /* Skip the first sge, this is for the RPCRDMA header */ + sge_no = 1; + + /* Head SGE */ + sge[sge_no].addr = ib_dma_map_single(xprt-sc_cm_id-device, +xdr-head[0].iov_base, +xdr-head[0].iov_len
[ofa-general] [RFC, PATCH 08/10] rdma: ONCRPC RDMA protocol marshalling
This logic parses the ONCRDMA protocol headers that precede the actual RPC header. It is placed in a separate file to keep all protocol aware code in a single place. Signed-off-by: Tom Tucker [EMAIL PROTECTED] --- net/sunrpc/svc_rdma_marshal.c | 424 + 1 files changed, 424 insertions(+), 0 deletions(-) diff --git a/net/sunrpc/svc_rdma_marshal.c b/net/sunrpc/svc_rdma_marshal.c new file mode 100644 index 000..feebabd --- /dev/null +++ b/net/sunrpc/svc_rdma_marshal.c @@ -0,0 +1,424 @@ +/* + * Copyright (c) 2005-2006 Network Appliance, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the BSD-type + * license below: + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * + * Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials provided + * with the distribution. + * + * Neither the name of the Network Appliance, Inc. nor the names of + * its contributors may be used to endorse or promote products + * derived from this software without specific prior written + * permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * Author: Tom Tucker [EMAIL PROTECTED] + */ + +#include asm/semaphore.h +#include linux/device.h +#include linux/in.h +#include linux/err.h +#include linux/time.h + +#include rdma/rdma_cm.h + +#include linux/sunrpc/svcsock.h +#include linux/sunrpc/debug.h +#include linux/sunrpc/rpc_rdma.h +#include linux/spinlock.h +#include linux/net.h +#include net/sock.h +#include asm/io.h +#include asm/unaligned.h +#include rdma/rdma_cm.h +#include rdma/ib_verbs.h +#include linux/sunrpc/rpc_rdma.h +#include linux/sunrpc/svc_rdma.h + +#define RPCDBG_FACILITYRPCDBG_SVCXPRT + +/* + * Decodes a read chunk list. The expected format is as follows: + *descrim : xdr_one + *position : u32 offset into XDR stream + *handle : u32 RKEY + *. . . + * end-of-list: xdr_zero + */ +static u32 *decode_read_list(u32 *va, u32 *vaend) +{ + struct rpcrdma_read_chunk *ch = (struct rpcrdma_read_chunk*)va; + + while (ch-rc_discrim != xdr_zero) { + u64 ch_offset; + + if (((unsigned long)ch + sizeof(struct rpcrdma_read_chunk)) + (unsigned long)vaend) { + dprintk(svcrdma: vaend=%p, ch=%p\n, vaend, ch); + return NULL; + } + + ch-rc_discrim = ntohl(ch-rc_discrim); + ch-rc_position = ntohl(ch-rc_position); + ch-rc_target.rs_handle = ntohl(ch-rc_target.rs_handle); + ch-rc_target.rs_length = ntohl(ch-rc_target.rs_length); + va = (u32*)ch-rc_target.rs_offset; + xdr_decode_hyper(va, ch_offset); + put_unaligned(ch_offset, (u64*)va); + ch++; + } + return (u32*)ch-rc_position; +} + +/* + * Determine number of chunks and total bytes in chunk list. The chunk + * list has already been verified to fit within the RPCRDMA header. + */ +void svc_rdma_rcl_chunk_counts(struct rpcrdma_read_chunk *ch, + int *ch_count, int *byte_count) +{ + /* compute the number of bytes represented by read chunks */ + *byte_count = 0; + *ch_count = 0; + for (; ch-rc_discrim != 0; ch++) { + *byte_count = *byte_count + ch-rc_target.rs_length; + *ch_count = *ch_count + 1; + } +} + +/* + * Decodes a write chunk list. The expected format is as follows: + *descrim : xdr_one + *nchunks : count + * handle : u32 RKEY ---+ + * length : u32 len
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
On Wed, 2007-08-15 at 22:26 -0400, Jeff Garzik wrote: [...snip...] I think removing the RDMA stack is the wrong thing to do, and you shouldn't just threaten to yank entire subsystems because you don't like the technology. Lets keep this constructive, can we? RDMA should get the respect of any other technology in Linux. Maybe its a niche in your opinion, but come on, there's more RDMA users than say, the sparc64 port. Eh? It's not about being a niche. It's about creating a maintainable software net stack that has predictable behavior. Isn't RDMA _part_ of the software net stack within Linux? Why isn't making RDMA stable, supportable and maintainable equally as important as any other subsystem? Needing to reach out of the RDMA sandbox and reserve net stack resources away from itself travels a path we've consistently avoided. I will NACK any patch that opens up sockets to eat up ports or anything stupid like that. Got it. Ditto for me as well. Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [ANNOUNCE] NFS-RDMA for OFED 1.2 G/A
For those interested in NFS-RDMA, OGC has created an install package based on the OFA 1.2 GA release. The package supports both SLES 10 and RHEL 5. You can download this package from http://www.opengridcomputing.com/nfs-rdma.html. Please let me know if you find any problems. Thanks, Tom ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] [PATCH] amso1100: QP init bug in amso driver
Roland: The guys at UNH found this and fixed it. I'm surprised no one has hit this before. I guess it only breaks when the refcount on the QP is non-zero. Initialize the wait_queue_head_t in the c2_qp structure. Signed-off-by: Ethan Burns [EMAIL PROTECTED] Acked-by: Tom Tucker [EMAIL PROTECTED] --- drivers/infiniband/hw/amso1100/c2_qp.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_qp.c b/drivers/infiniband/hw/amso1100/c2_qp.c index 420c138..01d0786 100644 --- a/drivers/infiniband/hw/amso1100/c2_qp.c +++ b/drivers/infiniband/hw/amso1100/c2_qp.c @@ -506,6 +506,7 @@ int c2_alloc_qp(struct c2_dev *c2dev, qp-send_sgl_depth = qp_attrs-cap.max_send_sge; qp-rdma_write_sgl_depth = qp_attrs-cap.max_send_sge; qp-recv_sgl_depth = qp_attrs-cap.max_recv_sge; + init_waitqueue_head(qp-wait); /* Initialize the SQ MQ */ q_size = be32_to_cpu(reply-sq_depth); ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] OFED July 2, meeting summary on next OFED plans
Roland: On Tue, 2007-07-03 at 15:35 -0500, Steve Wise wrote: Tom, can you update us on NFS-RDMA? Roland Dreier wrote: NFSoRDMA integration. I would like to see a status report on NFS/RDMA from the people who want it in OFED. As I understand it there are many core kernel changes required for this -- switchable transports and also mount option changes? You are correct about the scope of the changes, although many of them are already in the kernel. Chuck Lever just posted the mount changes and I have posted a second round of the NFS-RDMA patches. You can see these on [EMAIL PROTECTED] I would like to get them upstream in 2.6.23, but that's probably optimistic. As far as I can tell from the outside, the NFS/RDMA effort seems to have stalled -- whenever I talk to core NFS developers like Chuck Lever or Trond Myklebust, they say that they are just waiting for the NFS/RDMA developers to submit their changes for review. And I haven't seen any patches for a kernel newer that 2.6.18, so things look quite out-of-date. I'm not sure when you talked to those guys, but as I mentioned, this is round-two of the patch submission. There is also a git tree that has these submitted patches available for download and testing. These are on a 2.6.22-rc6 base and the git URL is git://linux-nfs.org/~tomtucker/nfs-rdma-dev-2.6.git If you like, I can post the patchset here as well. Without visible progress towards getting NFS/RDMA into mergeable form soon, I think putting it into OFED 1.3 as anything other than a technology preview that may be dropped from future releases would be a very risky think to do. Otherwise OFED risks getting stuck maintaining the whole NFS/RDMA stack, since the development effort outside of OFED really looks to me like it is fizzling out. Perhaps the activity is not where you're used to looking. Both Trond and Neal reviewed the previous patchset and provided feedback that I addressed in the most recent patchset. That said, I'm sure there will be quite a bit more before it's mergeable. - R. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: Incorrect max_sge reported in mthca device query
On Mon, 2007-04-02 at 09:08 +0300, Michael S. Tsirkin wrote: On Sun, 2007-04-01 at 09:43 +0300, Michael S. Tsirkin wrote: [...snip...] I think that if we extend the API, we need to design it carefully to cover as many use cases as possible. Tom, could you explain what are you trying to do? Why does your application need as many SGEs as possible? Mike: The application is NFS-RDMA. NFS keeps it's data as non-contiguous arrays of pages. So the motivation is that having a larger SGL allows you to support larger data transfers with a single operation. The challenge with the current query/request method is that as we've discussed the advertised max may not work. What makes the adjust/retry unworkable is that you don't know which of the advertised maxes caused the request to fail. So when you retry, which qp_attr do you adjust? The send sge? The recv sge? The qp depth? So what I'm proposing, and I think is similar if not identical to what other folks have talked about is having an interface that treats the qp_attr values as requested-sizes that can be adjusted by the provider. So for example, if I ask for a send_sge of 30, but you can only do 28, you give me 28 and adjust the qp_attr structure so that I know what I got. This would allow me to perform a predictable sequence of 1. query, 2. request, 3. adjust in my code. BTW, I think it needs to be new provider method to be done efficiently. Also, what's a good name, ib_request_qp? Thanks, Tom Also - what about out of resources cases described above? Would you expect the verbs API to retry the request for you? ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] Re: Incorrect max_sge reported in mthca device query
On Thu, 2007-04-05 at 09:27 -0700, Sean Hefty wrote: The challenge with the current query/request method is that as we've discussed the advertised max may not work. What makes the adjust/retry unworkable is that you don't know which of the advertised maxes caused the request to fail. So when you retry, which qp_attr do you adjust? The send sge? The recv sge? The qp depth? So what I'm proposing, and I think is similar if not identical to what other folks have talked about is having an interface that treats the qp_attr values as requested-sizes that can be adjusted by the provider. So for example, if I ask for a send_sge of 30, but you can only do 28, you give me 28 and adjust the qp_attr structure so that I know what I got. This would allow me to perform a predictable sequence of 1. query, 2. request, 3. adjust in my code. If the send sge/recv sge/qp depth/etc. aren't independent though, this pushes the problem and policy decision down to the provider. I can't think of an easy solution to this. Agreed. But practically I think they are. I think the SGE max is driven off the max size of a WR and type of QP. This is true of the iWARP adapters as well. But taking the bait...even if you didn't push it down to the provider, how do you expose the inter-relationships to the consumer? An approach in this vein is a could_you_would_you/why_not interface that would return whether or not the specified qp_attr would work and if it didn't some indication of which resource(s) caused the problem. The problems there are a) the resource may be gone when you go back with what you just had approved, and b) you still have to fuss with multiple whacks at it if you couldn't get what you asked for. I think something simpler, although arguably not perfect is the way to go. Tom - Sean ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] Re: Incorrect max_sge reported in mthca device query
Michael: Thanks for the detail reply. How about if we added an interface that would treat the SGE counts/WR counts as requests and then update the qp_init_attr struct with what was actually created? That would allow the app to request the max, but settle for what the device was capable of at the time. On Sun, 2007-04-01 at 09:43 +0300, Michael S. Tsirkin wrote: Quoting Tom Tucker [EMAIL PROTECTED]: Subject: Incorrect max_sge reported in mthca device query Roland: I think the max_sge reported by mthca_query_device is off by one. If you try to create a QP with the reported max, it fails with -EINVAL. I think the reason is that the mthca_alloc_wqe_buf function reserves a slot for a bind request and this pushes the WQE size over the 496B limit when the user requests the max (30) when allocating the QP. Please let me know if I'm confused about what max_sge really means. Thanks, Tom Tom, max_sge reported by mthca_query_device is the upper bound for all QP types. I have not tested this, but think you can create a UD type QP with this number of SGEs. I'd like to add that there can be no hard guarantee that creating a QP with a specific set of max_sge/max_wr always succeeds even if it is within the range of values reported by mthca_query_device: for example, for userspace QPs, the system administrator might have limited the amount of memory that can be locked up by these QPs, and QP allocation requests with large max_sge/max_wr values will always fail. There are other examples of this. Thus, an application that wants to use as large a number of SGEs/WRs as possible in a robust fashion currently has no other choice except a trial and error approach, handling failures gracefully. Finally, as a side note, it is *also* inefficient to request allocation of more sge entries than ULP will typically use - for reasons such as cache utilization, and many others. How does this overhead trade-off against the need to sometimes post multiple WRs by ULP will depend both on ULP and the hardware used. This need to tune the ULP to a specific HCA is annoying, and might be something that we want to try and solve at the API level. However, max_sge/max_wr values in query device are unlikely to be the appropriate API for this. One way out could be to extend the API for create_qp and friends, passing in both min and max values for some parameters, and allowing the verbs provider to choose the optimal combination of these. I think I floated a similiar proposal once already, but there didn't appear to be sufficient user support for such a large API extension. ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general