from:"Tom Tucker"

Re: [ofa-general] Cannot export multiple directories using nfs-rdma

2009-09-06 Thread Tom Tucker


Jeff Johnson wrote:
I have a nfs-rdma configuration using Mellanox ConnectX-DDR, ofed-1.4.2 
on Centos 5.3 x86_64.
My ConnectX cards are running 2.5.0 firmware as I have read that 2.6.0 
had rdma issues. I saw these issues and down rev'd the cards to 2.5.0.


I am seeing a peculiar behavior where if I export two separate 
directories from the server and attempt to mount them separately from a 
client I end up with the same export mounted to two different client 
directories.




Hi Jeff:

The mount service does not run over RDMA, it only runs on TCP/UDP. I 
believe you should be able to reproduce this behavior on plain old 
GigE/IPoIB.


Is this the case?

Tom



e.g.:server:/raid1
server:/raid2

'mount.rnfs 10.0.0.251:/raid1 /raid1 -i -o rdma,port=2050'
client:/raid1   ---has server:/raid1 contents
'mount.rnfs 10.0.0.251:/raid2 /raid2 -i -o rdma,port=2050'
client:/raid2   ---has server:/raid1 contents


I have tried creating multiple rdma ports on the server (2050 and 
2051) and then using different ports for each separate mount. The result 
is the same.


I have verified that I am indeed mounting rdma and not merely ipoib.

Is nfs-rdma capable of multiple exports? If so, I cannot find a 
method for dealing with multiple exports from the server or client side 
in any ofed docs.


Thanks for any assistance..

--
Jeff Johnson
Manager
Aeon Computing

jeff.john...@aeoncomputing.com
t: 858-412-3810   f: 858-412-3845
m: 619-204-9061

4905 Morena Boulevard, Suite 1313 - San Diego, CA 92117






___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit 
http://openib.org/mailman/listinfo/openib-general


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: NFSRDMA connectathon prelim. testing status,

2009-02-23 Thread Tom Tucker


Vu:

What memory registration model are you using?

Vu Pham wrote:

Hi Tom,

I have both nfsrdma client and server on 2.6.29-rc5 kernel, 
nfs-utils-1.1.4. I'm using both Infinihost III (ib_mthca) and ConnectX 
(mlx4_ib) HCAs

I have seen several problems during my testing at NFS Connectathon 2009

1. When I used ConnectX (mlx4_ib) HCAs on both client and server, the 
client can not mount. Talking to Tom Talpey and scanning the code, I saw 
that xprtrdma module is using ib_reg_phys_mr() and mlx4_ib verbs 
provider does not have the implementation for this verb.
If I have client on mlx4_ib and server on ib_mthca, I hit the following 
crash because of bad error handling in xprtrdma (see file attached - 
mlx4_mount_problem.log)


Because of this problem, I use InfiniHost III (ib_mthca) for all of my 
tests at Connectathon


2. Testing Linux nfsrdma client against both Linux and OpenSolaris 
nfsrdma servers, I hit the process hung problem during the 
connectathon's lock test (seeing sync_page_1.log and sync_page_2.log 
attached files). I can only reproduce it when I ran connectathon more 
than 500 iterations (-N 1000)

I can NOT reproduce the problem with nfs client/server over IPoIB

3. Testing openSolaris nfsrdma client against linux nfsrdma server, I 
hit the following BUG_ON() right away(see file attached - svcrdma_send.log)


thanks,
-vu



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [Fwd: RE: [ofa-general] NFS-RDMA (OFED1.4) with standard distributions ?]

2008-11-10 Thread Tom Tucker


Jeff:

Unfortunately, the NFSRDMA transport cannot make your disks go faster. 
If the storage subsystem is incapable of keeping up with IPoIB, then it 
won't be able to keep up with NFSRDMA either.


To compare NFSRDMA and IPoIB performance absent a very fast storage 
subsystem you'll need to keep the file sizes small enough such that they 
fit within the server cache.


Tom


Jeff Becker wrote:

Hi. Just passing this on in case you missed it. Do you have any advice
on what knobs to tweak to get better performance (than NFS/IPoIB)? Thanks.

-jeff

 Original Message 
Subject:RE: [ofa-general] NFS-RDMA (OFED1.4) with standard
distributions ?
Date:   Mon, 10 Nov 2008 16:27:50 +
From:   Ciesielski, Frederic (EMEA HPCOSLO CC) [EMAIL PROTECTED]
To: Jeff Becker [EMAIL PROTECTED]
CC: general@lists.openfabrics.org general@lists.openfabrics.org
References:
[EMAIL PROTECTED]
[EMAIL PROTECTED]



That's great, thanks.

I ran some tests with the 2.6.27 kernel as server and client, and basically it 
works fine.

I could not find yet any situation where NFS-RDMA would outperform NFS/IPoIB, 
at least when you compare apples to apples (same clients, same server, same 
protocol, and not just write to/read from the caches), and it even seems to 
have severe performance issues for reading with files larger than the memory 
size of the client and the server.
Hopefully this will improve when more users will be able to give valuable 
feedback...

Fred.

-Original Message-
From: Jeff Becker [mailto:[EMAIL PROTECTED]
Sent: Saturday, 08 November, 2008 22:35
To: Ciesielski, Frederic (EMEA HPCOSLO CC)
Cc: general@lists.openfabrics.org
Subject: Re: [ofa-general] NFS-RDMA (OFED1.4) with standard distributions ?

Ciesielski, Frederic (EMEA HPCOSLO CC) wrote:

Is there any chance that the new NFS-RDMA features coming with OFED
1.4 work with standard and current distributions, like RHEL5, SLES10 ?

Not yet, but I'm working on it. I intend for NFSRDMA to work on 2.6.27 and 
2.6.26 for OFED 1.4. The RHEL5 and SLES10 backports will likely be done for 
OFED 1.4.1. Thanks.

-jeff


Did anybody test this, or would pretend it is supposed to work ?

I mean without building a 2.6.27 or equivalent kernel on top of it,
keeping almost full support from the vendors.

Enhanced kernel modules may not be sufficient to work around the
limitations of old kernels...



--
--

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general




___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 00/03] RDMA Transport Support for 9P

2008-10-06 Thread Tom Tucker


Roland:

This patchset implements an RDMA transport provider for the 
v9fs (Plan 9 filesystem). Could you take a look at it and let us
know what you think?

Thanks,
Tom

Here is the original posting...

Eric:

This patch series implements an RDMA Transport provider for 9P and 
is relative to your for-next branch.  The RDMA support is built on the 
OpenFabrics API and uses SEND and RECV to exchange data. This patch 
series has been tested with dbench and iozone.

Signed-off-by: Tom Tucker [EMAIL PROTECTED]
Signed-off-by: Latchesar Ionkov [EMAIL PROTECTED]

[PATCH 01/03] 9prdma: RDMA Transport Support for 9P

 net/9p/trans_rdma.c |  996 +++
 1 files changed, 996 insertions(+), 0 deletions(-)

[PATCH 02/03] 9prdma: Makefile change for the RDMA transport

 net/9p/Makefile |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

[PATCH 03/03] 9prdma: Kconfig changes for the RDMA transport

 net/9p/Kconfig |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 02/03] 9prdma: Makefile change for the RDMA transport

2008-10-06 Thread Tom Tucker

This adds a make rule for the 9pnet_rdma module that implements
the RDMA transport.

Signed-off-by: Tom Tucker [EMAIL PROTECTED]
Signed-off-by: Latchesar Ionkov [EMAIL PROTECTED]

---
 net/9p/Makefile |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/net/9p/Makefile b/net/9p/Makefile
index 5192194..bc909ab 100644
--- a/net/9p/Makefile
+++ b/net/9p/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_NET_9P) := 9pnet.o
 obj-$(CONFIG_NET_9P_VIRTIO) += 9pnet_virtio.o
+obj-$(CONFIG_NET_9P_RDMA) += 9pnet_rdma.o
 
 9pnet-objs := \
mod.o \
@@ -12,3 +13,6 @@ obj-$(CONFIG_NET_9P_VIRTIO) += 9pnet_virtio.o
 
 9pnet_virtio-objs := \
trans_virtio.o \
+
+9pnet_rdma-objs := \
+   trans_rdma.o \
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 03/03] 9prdma: Kconfig changes for the RDMA transport

2008-10-06 Thread Tom Tucker

This patch adds a config option for the 9P RDMA transport.

Signed-off-by: Tom Tucker [EMAIL PROTECTED]
Signed-off-by: Latchesar Ionkov [EMAIL PROTECTED]

---
 net/9p/Kconfig |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/net/9p/Kconfig b/net/9p/Kconfig
index ff34c5a..c42c0c4 100644
--- a/net/9p/Kconfig
+++ b/net/9p/Kconfig
@@ -20,6 +20,12 @@ config NET_9P_VIRTIO
  This builds support for a transports between
  guest partitions and a host partition.
 
+config NET_9P_RDMA
+   depends on NET_9P  INFINIBAND  EXPERIMENTAL
+   tristate 9P RDMA Transport (Experimental)
+   help
+ This builds support for a RDMA transport.
+
 config NET_9P_DEBUG
bool Debug information
depends on NET_9P
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH 01/03] 9prdma: RDMA Transport Support for 9P

2008-10-06 Thread Tom Tucker

This file implements the RDMA transport provider for 9P. It allows
mounts to be performed over iWARP and IB capable network interfaces
and uses the OpenFabrics API to perform I/O.

Signed-off-by: Tom Tucker [EMAIL PROTECTED]
Signed-off-by: Latchesar Ionkov [EMAIL PROTECTED]

---
 net/9p/trans_rdma.c | 1025 +++
 1 files changed, 1025 insertions(+), 0 deletions(-)

diff --git a/net/9p/trans_rdma.c b/net/9p/trans_rdma.c
new file mode 100644
index 000..f919768
--- /dev/null
+++ b/net/9p/trans_rdma.c
@@ -0,0 +1,1025 @@
+/*
+ * linux/fs/9p/trans_rdma.c
+ *
+ * RDMA transport layer based on the trans_fd.c implementation.
+ *
+ *  Copyright (C) 2008 by Tom Tucker [EMAIL PROTECTED]
+ *  Copyright (C) 2006 by Russ Cox [EMAIL PROTECTED]
+ *  Copyright (C) 2004-2005 by Latchesar Ionkov [EMAIL PROTECTED]
+ *  Copyright (C) 2004-2008 by Eric Van Hensbergen [EMAIL PROTECTED]
+ *  Copyright (C) 1997-2002 by Ron Minnich [EMAIL PROTECTED]
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2
+ *  as published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to:
+ *  Free Software Foundation
+ *  51 Franklin Street, Fifth Floor
+ *  Boston, MA  02111-1301  USA
+ *
+ */
+
+#include linux/in.h
+#include linux/module.h
+#include linux/net.h
+#include linux/ipv6.h
+#include linux/kthread.h
+#include linux/errno.h
+#include linux/kernel.h
+#include linux/un.h
+#include linux/uaccess.h
+#include linux/inet.h
+#include linux/idr.h
+#include linux/file.h
+#include linux/parser.h
+#include net/9p/9p.h
+#include net/9p/transport.h
+#include rdma/ib_verbs.h
+#include rdma/rdma_cm.h
+#include rdma/ib_verbs.h
+
+#define P9_PORT5640
+#define P9_RDMA_SQ_DEPTH   32
+#define P9_RDMA_RQ_DEPTH   32
+#define P9_RDMA_SEND_SGE   4
+#define P9_RDMA_RECV_SGE   4
+#define P9_RDMA_IRD0
+#define P9_RDMA_ORD0
+#define P9_RDMA_TIMEOUT3   /* 30 seconds */
+#define P9_RDMA_MAXSIZE(4*4096)/* Min SGE is 4, so we 
can
+* safely advertise a maxsize
+* of 64k */
+
+#define P9_RDMA_MAX_SGE (P9_RDMA_MAXSIZE  PAGE_SHIFT)
+/**
+ * struct p9_trans_rdma - RDMA transport instance
+ *
+ * @state: tracks the transport state machine for connection setup and tear 
down
+ * @cm_id: The RDMA CM ID
+ * @pd: Protection Domain pointer
+ * @qp: Queue Pair pointer
+ * @cq: Completion Queue pointer
+ * @lkey: The local access only memory region key
+ * @next_tag: The next tag for tracking rpc
+ * @timeout: Number of uSecs to wait for connection management events
+ * @sq_depth: The depth of the Send Queue
+ * @sq_count: Number of WR on the Send Queue
+ * @rq_depth: The depth of the Receive Queue. NB: I _think_ that 9P is
+ * purely req/rpl (i.e. no unaffiliated replies, but I'm not sure, so
+ * I'm allowing this to be tweaked separately.
+ * @addr: The remote peer's address
+ * @req_lock: Protects the active request list
+ * @req_list: List of sent RPC awaiting replies
+ * @send_wait: Wait list when the SQ fills up
+ * @cm_done: Completion event for connection management tracking
+ */
+struct p9_trans_rdma {
+   enum {
+   P9_RDMA_INIT,
+   P9_RDMA_ADDR_RESOLVED,
+   P9_RDMA_ROUTE_RESOLVED,
+   P9_RDMA_CONNECTED,
+   P9_RDMA_FLUSHING,
+   P9_RDMA_CLOSING,
+   P9_RDMA_CLOSED,
+   } state;
+   struct rdma_cm_id *cm_id;
+   struct ib_pd *pd;
+   struct ib_qp *qp;
+   struct ib_cq *cq;
+   struct ib_mr *dma_mr;
+   u32 lkey;
+   atomic_t next_tag;
+   long timeout;
+   int sq_depth;
+   atomic_t sq_count;
+   int rq_depth;
+   struct sockaddr_in addr;
+
+   spinlock_t req_lock;
+   struct list_head req_list;
+
+   wait_queue_head_t send_wait;
+   struct completion cm_done;
+   struct p9_idpool *tagpool;
+};
+
+/**
+ * p9_rdma_context - Keeps track of in-process WR
+ *
+ * @wc_op: Mellanox's broken HW doesn't provide the original WR op
+ * when the CQE completes in error. This forces apps to keep track of
+ * the op themselves. Yes, it's a Pet Peeve of mine ;-)
+ * @busa: Bus address to unmap when the WR completes
+ * @req: Keeps track of requests (send)
+ * @rcall: Keepts track of replies (receive)
+ */
+struct p9_rdma_req;
+struct p9_rdma_context {
+   enum ib_wc_opcode wc_op;
+   dma_addr_t busa;
+   union

Re: [ofa-general] [PATCH 00/03] RDMA Transport Support for 9P

2008-10-06 Thread Tom Tucker


Roland Dreier wrote:
  This patchset implements an RDMA transport provider for the 
  v9fs (Plan 9 filesystem). Could you take a look at it and let us

  know what you think?

I sent comments on the initial posting I saw on lkml ... did they not
make it to you?



No, I just missed it. Sorry. I just responded to your comments,


  [PATCH 01/03] 9prdma: RDMA Transport Support for 9P
  [PATCH 02/03] 9prdma: Makefile change for the RDMA transport
  [PATCH 03/03] 9prdma: Kconfig changes for the RDMA transport

one meta-comment I didn't send last time: the patches are small enough
that I would just send it all in one patch, since it makes sense to
apply it that way anyway.



Ok, makes my life easy.


 - R.


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Fast Reg Question

2008-07-09 Thread Tom Tucker


Roland:

I'm a little perplexed by the fast reg WR definition. The context is 
that I'm using the Fast Reg verb to map the local memory that is the 
data source for an RDMA_WRITE.


The WR format, however, only takes an rkey. How does this all work when 
you're using fast reg to map local memory? Does the WR really need the 
mr pointer, or both the lkey and rkey? The IBTA spec seems to indicate 
that it needs more information about the MR than just the rkey.


Tom

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] Fast Reg Question

2008-07-09 Thread Tom Tucker


Roland Dreier wrote:

  I'm a little perplexed by the fast reg WR definition. The context is
  that I'm using the Fast Reg verb to map the local memory that is the
  data source for an RDMA_WRITE.
  
  The WR format, however, only takes an rkey. How does this all work

  when you're using fast reg to map local memory? Does the WR really
  need the mr pointer, or both the lkey and rkey? The IBTA spec seems to
  indicate that it needs more information about the MR than just the
  rkey.

On Mellanox, L_Key and R_Key are always the same, 


Also true for iWARP.

so it doesn't really
matter.  I think in general the idea would be that the L_Key you have
gets updatedd with any consumer key changes you make in the WR but
otherwise works the same. 

Fair enough. Use the mr-lkey value in the SGE for subsequent DTO.

 the WR processing better be able to find the
MR by R_Key so I think it's OK.

  
It just seems a little weird to be supplying the R_Key when you're 
mapping local memory.

I'll look at the IB spec though.

  
The spec refers to a bunch of verification on the L_Key. Obviously, if 
the L_Key and R_Key are the same the distinction is moot.

 - R.
  


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-06-12 Thread Tom Tucker


On Mon, 2008-05-26 at 08:07 -0500, Steve Wise wrote:
 
 
 Roland Dreier wrote:
- device-specific alloc/free of physical buffer lists for use in fast
register work requests.  This allows devices to allocate this memory as
needed (like via dma_alloc_coherent).
  
  I'm looking at how one would implement the MM extensions for mlx4, and
  it turns out that in addition to needing to allocate these fastreg page
  lists in coherent memory, mlx4 is even going to need to write to the
  memory (basically set the lsb of each address for internal device
  reasons).  So I think we just need to update the documentation of the
  interface so that not only does the page list belong to the device
  driver between posting the fastreg work request and completing the
  request, but also the device driver is allowed to change the page list
  as part of the work request processing.
  
  I don't see any real reason why this would cause problems for consumers;
  does this seem OK to other people?
 
 Tom,
 
 Does this affect how you plan to implement NFSRDMA MEM_MGT_EXTENTIONS 
 support?

I think this is ok. 


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] device attributes

2008-06-04 Thread Tom Tucker


On Wed, 2008-06-04 at 09:28 -0500, Steve Wise wrote:
 Roland/All,
 
 Should the device attributes (for instance max_send_wr) be the max 
 supported by the HW or the max supported by the OS or something else?
 

Something else. 

 For instance:  Chelsio's HW can handle very large work queues, but since 
 Linux limits the size of contiguous dma coherent memory allocations, the 
 actual limits are much smaller.

Basing the limit on an OS resource seems arbitrary and dangerous.
Applications using advertised adapter resource limits will unnecessarily
consume the maximum.

   Which should I be using for the device 
 attributes?
 

Arbitrary knee jerk == 512. 

However, surveying the current app usage as well as the other
manufacturers advertised limits will make it less arbitrary.

 Also, the chelsio device uses a single work queue to implement the SQ 
 and RQ abstractions.  So the max SQ depth depends on the RQ depth and 
 vice versa.  This leads to device max attributes that aren't that useful.
 

So the real limit is the HW WQ max and therefore max SQ = HW WQ max -
RQ max?  Setting RQ and SQ to 512 solves this problem. 


 I'm wondering what application writes should glean from these attributes...
 

I guess you mean application writer? 

Here's what I suggest:

- Set the RQ and SQ max to some reasonable default limit (e.g. 512).
- Add an escape hatch by providing module options to override
  the default max.

Tom

 Steve
 
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-27 Thread Tom Tucker


On Tue, 2008-05-27 at 10:33 -0500, Tom Tucker wrote:
 On Mon, 2008-05-26 at 16:02 -0700, Roland Dreier wrote:
   The invalidate local stag part of a read is just a local sink side
operation (ie no wire protocol change from a read).  It's not like
processing an ingress send-with-inv.  It is really functionally like a
read followed immediately by a fenced invalidate-local, but it doesn't
stall the pipe.  So the device has to remember the read is a with inv
local stag and invalidate the stag after the read response is placed
and before the WCE is reaped by the application.
  
  Yes, understood.  My point was just that in IB, at least in theory, one
  could just use an L_Key that doesn't have any remote permissions in the
  scatter list of an RDMA read, while in iWARP, the STag used to place an
  RDMA read response has to have remote write permission.  So RDMA read
  with invalidate makes sense for iWARP, because it gives a race-free way
  to allow an STag to be invalidated immediately after an RDMA read
  response is placed, while in IB it's simpler just to never give remote
  access at all.
  
 
 So I think from an NFSRDMA coding perspective it's a wash...
 
 When creating the local data sink, We need to check the transport type.
 
 If it's IB -- only local access,
 if it's iWARP -- local + remote access.
 
 When posting the WR, We check the fastreg capabilities bit + transport type 
 bit:
 If fastreg is true --
   Post FastReg
   If iWARP (or with a cap bit read-with-inv-flag)
   post rdma read w/ invalidate
   else /* IB */
   post rdma read

Steve pointed out a good optimization here. Instead of fencing the RDMA
READ here in advance of the INVALIDATE, we should post the INVALIDATE
when the READ WR completes. This will avoid stalling the SQ. Since IB
doesn't put the LKEY on the wire, there's no security issue to close. We
need to keep a bunch of fastreg MR around anyway for concurrent RPC.

Thoughts?
Tom

   post invalidate
   fi
 else
   ... today's logic
 fi
 
 I make the observation, however, that the transport type is now overloaded
 with a set of required verbs. For iWARP's case, this means rdma-read-w-inv,
 plus rdma-send-w-inv, etc... This also means that new transport types will
 inherit one or the other set of verbs (IB or iWARP).
 
 Tom
 
 
   - R.
  ___
  general mailing list
  general@lists.openfabrics.org
  http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
  
  To unsubscribe, please visit 
  http://openib.org/mailman/listinfo/openib-general
 
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] [PATCH RFC v3 1/2] RDMA/Core: MEM_MGT_EXTENSIONS support

2008-05-27 Thread Tom Tucker


On Tue, 2008-05-27 at 12:39 -0400, Talpey, Thomas wrote:
 At 11:33 AM 5/27/2008, Tom Tucker wrote:
 So I think from an NFSRDMA coding perspective it's a wash...
 
 Just to be clear, you're talking about the NFS/RDMA server. However, it's
 pretty much a wash on the client, for different reasons.
 
Tom:

What client side memory registration strategy do you recommend if the
default on the server side is fastreg?

On the performance side we are limited by the min size of the
read/write-chunk element. If the client still gives the server a 4k
chunk, the performance benefit (fewer PDU on the wire) goes away.

Tom

 When posting the WR, We check the fastreg capabilities bit + transport 
 type bit:
 If fastreg is true --
Post FastReg
If iWARP (or with a cap bit read-with-inv-flag)
post rdma read w/ invalidate
 
 ... For iWARP's case, this means rdma-read-w-inv,
 plus rdma-send-w-inv, etc... 
 
 
 Maybe I'm confused, but I don't understand this. iWARP RDMA Read requests
 don't support remote invalidate. At least, the table in RFC5040 (p.22) 
 doesn't:
 
 
 
---+---+---+--+---+---+--
RDMA   | Message   | Tagged| STag | Queue | Invalidate| Message
Message| Type  | Flag  | and  | Number| STag  | Length
OpCode |   |   | TO   |   |   | Communicated
   |   |   |  |   |   | between DDP
   |   |   |  |   |   | and RDMAP
---+---+---+--+---+---+--
b  | RDMA Write| 1 | Valid| N/A   | N/A   | Yes
   |   |   |  |   |   |
---+---+---+--+---+---+--
0001b  | RDMA Read | 0 | N/A  | 1 | N/A   | Yes
   | Request   |   |  |   |   |
---+---+---+--+---+---+--
0010b  | RDMA Read | 1 | Valid| N/A   | N/A   | Yes
   | Response  |   |  |   |   |
---+---+---+--+---+---+--
0011b  | Send  | 0 | N/A  | 0 | N/A   | Yes
   |   |   |  |   |   |
---+---+---+--+---+---+--
0100b  | Send with | 0 | N/A  | 0 | Valid | Yes
   | Invalidate|   |  |   |   |
---+---+---+--+---+---+--
0101b  | Send with | 0 | N/A  | 0 | N/A   | Yes
   | SE|   |  |   |   |
---+---+---+--+---+---+--
0110b  | Send with | 0 | N/A  | 0 | Valid | Yes
   | SE and|   |  |   |   |
   | Invalidate|   |  |   |   |
---+---+---+--+---+---+--
0111b  | Terminate | 0 | N/A  | 2 | N/A   | Yes
   |   |   |  |   |   |
---+---+---+--+---+---+--
1000b  |   |
to | Reserved  |   Not Specified
b  |   |
---+---+-
 
 
 
 I want to take this opportunity to also mention that the RPC/RDMA 
 client-server
 exchange does not support remote-invalidate currently. Because of the multiple
 stags supported by the rpcrdma chunking header, and because the client needs
 to verify that the stags were in fact invalidated, there is significant 
 overhead,
 and the jury is out on that benefit. In fact, I suspect it's a lose at the 
 client.
 
 Tom (Talpey).  
 
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH] AMSO1100: Add check for NULL reply_msg in c2_intr

2008-04-08 Thread Tom Tucker


On Fri, 2008-04-04 at 12:35 -0700, Roland Dreier wrote:
  I'm up to my eyeballs right now. If it's ok with you I'd say defer the
   refactoring.
 
 No problem, I'll queue this up and if you ever get time to work on
 amso1100 you can send the refactoring.
 
 But are you working on a pmtu fix?

Steve and I will noodle on what to do here and post something.


  - R.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH] AMSO1100: Add check for NULL reply_msg in c2_intr

2008-04-04 Thread Tom Tucker


AMSO1100: Add check for NULL reply_msg in c2_intr

This is a checker-found bug posted to bugzilla.kernel.org (7478). Upon
inspection I also found a place where we could attempt to kmem_cache_free
a null pointer.

Signed-off-by: Tom Tucker [EMAIL PROTECTED]
---

Roland,

I don't think anyone has ever hit this bug, so it is a low priority in my view. 
I also noticed that
if we refactored vq_wait_for_reply that we could combine a common 

if (!reply) {
err = -ENOMEM;
goto bail;
}

construct by guaranteeing that reply is non-null if vq_wait_for_reply returns 
without
an error. This patch, however, is much smaller. What do you think?

 drivers/infiniband/hw/amso1100/c2_cq.c   |4 ++--
 drivers/infiniband/hw/amso1100/c2_intr.c |6 +-
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c 
b/drivers/infiniband/hw/amso1100/c2_cq.c
index d2b3366..bb17cce 100644
--- a/drivers/infiniband/hw/amso1100/c2_cq.c
+++ b/drivers/infiniband/hw/amso1100/c2_cq.c
@@ -422,8 +422,8 @@ void c2_free_cq(struct c2_dev *c2dev, struct c2_cq *cq)
goto bail1;
 
reply = (struct c2wr_cq_destroy_rep *) (unsigned long) 
(vq_req-reply_msg);
-
-   vq_repbuf_free(c2dev, reply);
+   if (reply)
+   vq_repbuf_free(c2dev, reply);
   bail1:
vq_req_free(c2dev, vq_req);
   bail0:
diff --git a/drivers/infiniband/hw/amso1100/c2_intr.c 
b/drivers/infiniband/hw/amso1100/c2_intr.c
index 0d0bc33..3b50954 100644
--- a/drivers/infiniband/hw/amso1100/c2_intr.c
+++ b/drivers/infiniband/hw/amso1100/c2_intr.c
@@ -174,7 +174,11 @@ static void handle_vq(struct c2_dev *c2dev, u32 mq_index)
return;
}
 
-   err = c2_errno(reply_msg);
+   if (reply_msg)
+   err = c2_errno(reply_msg);
+   else
+   err = -ENOMEM;
+
if (!err) switch (req-event) {
case IW_CM_EVENT_ESTABLISHED:
c2_set_qp_state(req-qp,

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: [PATCH] AMSO1100: Add check for NULL reply_msg in c2_intr

2008-04-04 Thread Tom Tucker


On Fri, 2008-04-04 at 12:22 -0700, Roland Dreier wrote:
  I don't think anyone has ever hit this bug, so it is a low priority in my 
  view. I also noticed that
   if we refactored vq_wait_for_reply that we could combine a common 
   
   if (!reply) {
  err = -ENOMEM;
  goto bail;
   }
   
   construct by guaranteeing that reply is non-null if vq_wait_for_reply 
 returns without
   an error. This patch, however, is much smaller. What do you think?
 
 Well, now is a good time to merge either version of the fix.  Would be
 nice to kill off one of the Coverity issues so I'm happy to take this.
 
 It's up to you how much effort you want to spend on this... the
 refactoring sounds nice but I think we're OK without it.
 

I'm up to my eyeballs right now. If it's ok with you I'd say defer the
refactoring.

  - R.

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [nfs-rdma-devel] [ofa-general] Status of NFS-RDMA ? (fwd)

2008-03-02 Thread Tom Tucker


On Fri, 2008-02-29 at 09:29 +0100, Sebastian Schmitzdorff wrote:
 hi pawel,
 
 I was wondering if you have achieved better nfs rdma benchmark results 
 by now?

Pawel:

What is your network hardware setup? 

Thanks,
Tom

 
 regards
 Sebastian
 
 Pawel Dziekonski schrieb:
  hi,
 
  the saga continues. ;)
 
  very basic benchmarks and surprising (at least for me) results - it
  look's like reading is much slower than writing and NFS/RDMA is twice
  slower in reading than classic NFS. :o
 
  results below - comments appreciated!
  regards, Pawel
 
 
  both nfs server and client have 8-cores, 16 GB RAM, Mellanox DDR HCAs
  (MT25204) connected port-port (no switch).
 
  local_hdd - 2 sata2 disks in soft-raid0,
  nfs_ipoeth - classic nfs over ethernet,
  nfs_ipoib - classic nfs over IPoIB,
  nfs_rdma - NFS/RDMA.
 
  simple write of 36GB file with dd (both machines have 16GB RAM):
  /usr/bin/time -p dd if=/dev/zero of=/mnt/qqq bs=1M count=36000
 
  local_hddsys 54.52user 0.04real 254.59
   
  nfs_ipoibsys 36.35user 0.00real 266.63
  nfs_rdma sys 39.03user 0.02real 323.77
  nfs_ipoeth   sys 34.21user 0.01real 375.24
 
  remount /mnt to clear cache and read a file from nfs share and
  write it to /dev/:
  /usr/bin/time -p dd if=/mnt/qqq of=/scratch/qqq bs=1M
 
  nfs_ipoib   sys 59.04user 0.02real 571.57
  nfs_ipoeth  sys 58.92user 0.02real 606.61
  nfs_rdmasys 62.57user 0.03real 1296.36
 
 
 
  results from bonnie++:
 
  Version  1.03c  --Sequential Write -- --Sequential Read -- 
  --Random-
  -Per Chr- --Block-- -Rewrite- -Per Chr-  --Block-- 
  --Seeks--
  MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP  K/sec %CP  
  /sec %CP
  local_hdd  35G:128k   93353  12 58329   6   143293   7 
  243.6   1
  local_hdd  35G:256k   92283  11 58189   6   144202   8 
  172.2   2
  local_hdd  35G:512k   93879  12 57715   6   144167   8 
  128.2   4
  local_hdd 35G:1024k   93075  12 58637   6   144172   8  
  95.3   7
  nfs_ipoeth 35G:128k   91325   7 31848   464299   4 
  170.2   1
  nfs_ipoeth 35G:256k   90668   7 32036   564542   4 
  163.2   2
  nfs_ipoeth 35G:512k   93348   7 31757   564454   4  
  85.7   3
  nfs_ipoet 35G:1024k   91283   7 31869   564241   5  
  51.7   4
  nfs_ipoib  35G:128k   91733   7 36641   565839   4 
  178.4   2
  nfs_ipoib  35G:256k   92453   7 36567   666682   4 
  166.9   3
  nfs_ipoib  35G:512k   91157   7 37660   666318   4  
  86.8   3
  nfs_ipoib 35G:1024k   92111   7 35786   666277   5  
  53.3   4
  nfs_rdma   35G:128k   91152   8 29942   532147   2 
  187.0   1
  nfs_rdma   35G:256k   89772   7 30560   534587   2 
  158.4   3
  nfs_rdma   35G:512k   91290   7 29698   534277   2  
  60.9   2
  nfs_rdma  35G:1024k   91336   8 29052   531742   2  
  41.5   3
  --Sequential Create-- Random 
  Create
  -Create-- --Read--- -Delete-- -Create-- --Read--- 
  -Delete--
  files:max:min/sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec 
  %CP
  local_hdd16 10587  36 + +++  8674  29 10727  35 + +++  7015 
   28
  local_hdd16 11372  41 + +++  8490  29 11192  43 + +++  6881 
   27
  local_hdd16 10789  35 + +++  8520  29 11468  46 + +++  6651 
   24
  local_hdd16 10841  40 + +++  8443  28 11162  41 + +++  6441 
   22
  nfs_ipoeth   16  3753   7 13390  12  3795   7  3773   8 22181  16  3635 
7
  nfs_ipoeth   16  3762   8 12358   7  3713   8  3753   7 20448  13  3632 
6
  nfs_ipoeth   16  3834   7 12697   6  3729   8  3725   9 22807  11  3673 
7
  nfs_ipoeth   16  3729   8 14260  10  3774   7  3744   7 25285  14  3688 
7
  nfs_ipoib16  6803  17 + +++  6843  15  6820  14 + +++  5834 
   11
  nfs_ipoib16  6587  16 + +++  4959   9  6832  14 + +++  5608 
   12
  nfs_ipoib16  6820  18 + +++  6636  15  6479  15 + +++  5679 
   13
  nfs_ipoib16  6475  14 + +++  6435  14  5543  11 + +++  5431 
   11
  nfs_rdma 16  7014  15 + +++  6714  10  7001  14 + +++  5683 
8
  nfs_rdma 16  7038  13 + +++  6713  12  6956  11 + +++  5488 
8
  nfs_rdma 16  7058  12 + +++  6797  11  6989  14 + +++  5761 
9
  nfs_rdma 16  7201  13 + +++  6821  12  7072  15 + +++  5609 
9
 
 

 
 

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit

Re: [ofa-general] post_recv question

2008-02-22 Thread Tom Tucker




On 2/22/08 12:09 AM, Roland Dreier [EMAIL PROTECTED] wrote:

 I think we can assume that the ringing of the doorbell is synchronous,
 i.e. when the processor completes it's write, the card knows there are
 RQ WQE available in host memory,
 
 It doesn't affect your larger point, but to be pedantically precise,
 writes across PCI will be posted, so the CPU may fully retire a write
 to MMIO long before that write completes at its final destination.
 

You're right. In fact, I think up to 4 words for the common implementation.
But I think this speaks again to the claim that guarantees between adapters
on different busses can't work because posted writes go to different FIFO's.

  - R.


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] post_recv question

2008-02-21 Thread Tom Tucker


On Thu, 2008-02-21 at 12:22 -0800, Roland Dreier wrote:
  OpenMPI can be configured to send credit updates over different QP. I'll
   try to stress it next week to see what happens.
 
 It seems that it would be pretty hard to hit this race in practice.

 And I don't think mem-free Mellanox hardware has any race -- not
 positive about Tavor/non-mem-free Arbel.  (On IB you need to set RNR
 retries to 0 also for the missing receive to be detectable even if the
 race exists)

Wellconsider the case of two adapters on two different pci busses.
One is busy one is not. Specifically, the post_recv QP is on an HCA on a
busy bus, the post_send (of the credit) is on a QP on an HCA on a
dedicated bus. 

I think we can assume that the ringing of the doorbell is synchronous,
i.e. when the processor completes it's write, the card knows there are
RQ WQE available in host memory, but whether or not and when the WQE is
fetched relative to the processor is asynchronous. The card will have to
get on the bus again and read host memory. Meanwhile the processor runs
off and posts a send on the other QP on a different HCA of the credit.
The peer responds, with a send to the data qp. The receiving adapter
knows the WQE is there, but it may not have fetched it yet.

The crux of the question is whether or not the adapter MUST fetch the
WQE and place the packet, or can it simply drop it. If you say it MUST,
then you must have enough buffer to handle worst case delayed placement.
If the post guarantee is only within the same QP or affiliated QP (SRQ),
then all it must do is ensure that when processing a SQ request AND the
associated RQ (SRQ) is empty, that it must fetch outstanding, unread RQ
WQE prior to processing the SQ WQE. This allows for the post_recv
guarantees without the HCA buffering requirements.

I seem to recall that the specs say something about ordering and
synchronization between unaffiliated QP and/or between adapters, but the
specific reference long ago fell off my LRU list.

Tom

 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] post_recv question

2008-02-21 Thread Tom Tucker


On Thu, 2008-02-21 at 15:48 -0800, Caitlin Bestler wrote:
 Good example, more detailed comments in-line.
 
 On Thu, Feb 21, 2008 at 2:47 PM, Tom Tucker [EMAIL PROTECTED] wrote:
 
   On Thu, 2008-02-21 at 12:22 -0800, Roland Dreier wrote:
 OpenMPI can be configured to send credit updates over different QP. 
  I'll
  try to stress it next week to see what happens.
   
It seems that it would be pretty hard to hit this race in practice.
 
And I don't think mem-free Mellanox hardware has any race -- not
positive about Tavor/non-mem-free Arbel.  (On IB you need to set RNR
retries to 0 also for the missing receive to be detectable even if the
race exists)
 
   Wellconsider the case of two adapters on two different pci busses.
   One is busy one is not. Specifically, the post_recv QP is on an HCA on a
   busy bus, the post_send (of the credit) is on a QP on an HCA on a
   dedicated bus.
 
   I think we can assume that the ringing of the doorbell is synchronous,
   i.e. when the processor completes it's write, the card knows there are
   RQ WQE available in host memory, but whether or not and when the WQE is
   fetched relative to the processor is asynchronous. The card will have to
   get on the bus again and read host memory. Meanwhile the processor runs
   off and posts a send on the other QP on a different HCA of the credit.
   The peer responds, with a send to the data qp. The receiving adapter
   knows the WQE is there, but it may not have fetched it yet.
 
   The crux of the question is whether or not the adapter MUST fetch the
   WQE and place the packet, or can it simply drop it. If you say it MUST,
   then you must have enough buffer to handle worst case delayed placement.
   If the post guarantee is only within the same QP or affiliated QP (SRQ),
   then all it must do is ensure that when processing a SQ request AND the
   associated RQ (SRQ) is empty, that it must fetch outstanding, unread RQ
   WQE prior to processing the SQ WQE. This allows for the post_recv
   guarantees without the HCA buffering requirements.
 
 
 I disagree. What is required is the adapter MUST NOT take an action based
 on a buffer not available diagnosis until it is certain that it has 
 considered
 all WQEs that have been successfully posted by the consumer.
 

Ok. So what does the HW do with the packet while it's pondering it's
options? It has to put it somewhere. That's my point. You either
guarantee that any advertisement of availability can't be issued prior
to the buffer being available, or the buffer is synchronously available
prior to the advertisement of the credit. Snooping the [s]RQ while
processing SQ is a way of delaying the issuance of a credit before the
buffer (spec'd in the WQE) is actually known to the adapter. But this
only works in the context of a single HBA.

 Further, it MUST NOT require a further action by the consumer to guarantee
 that it notices a posted WQE. 

Agreed. 

 Particularly in iWARP the application layer
 is free to implement Send/Recv credits by *any* mechanism desired (the
 only requirement is that there is one, you might recall that there were
 extensive discussions on this point regarding unsolicited messages for
 iSER). The concept that the application MUST provide SOME form of
 flow control was accepted only grudgingly. So clearly any more specific
 mechanisms were not the intent of the drafters.

Yes, but I'm not sure there's any confusion there -- I think this
discussion is about how credits can be issued. In particular what does
it mean to issue a credit for:
- this QP,
- another QP on the same HCA
- another QP on a different HCA

So far, it seems the consensus is that all of the above should work.
I'm just not convinced the current implementations guarantee this.

 
 So if there are still 1000 Recv WQEs in the SRQ we can allow the adapter
 a great deal of flexibility in when the 1001st is linked into the data
 structures.
 The only real constraint is that it MUST do 1001 successful allocations
 *before* it triggers any sort of buffer not available error.
 

agreed.

 I'm not recalling the specific language immediately, but I do recall 
 concluding
 that sub-dividing the SRQ on an RSS-like basis was *not* compliant with
 the RDMAC specs and that the left-half of the adpater could not declare
 buffer not found while the right-half of the adapter still had a free 
 buffer.

agreed.

 This is of course a major pain if you are trying to team two RDMA adapters
 to form a single virtual adapter, or even two largely independent ports on
 the same physical adapter. But the intent of the specifications are very
 clear: if the consumer has posted 1000 recv WQEs and gotten SUCCESS
 to each of them, then the adapter MUST allocate all 1000 recv WQEs
 *before* it can fail an operation because no buffer was available.
 

agreed.

 So there is a difference between must be pushed to the adapter now
 and must be pushed to the adapter before it is too late.

yes. 


Tom

Re: [ofa-general] iommu dma mapping alignment requirements

2007-12-20 Thread Tom Tucker


On Thu, 2007-12-20 at 11:14 -0600, Steve Wise wrote:
 Hey Roland (and any iommu/ppc/dma experts out there):
 
 I'm debugging a data corruption issue that happens on PPC64 systems 
 running rdma on kernels where the iommu page size is 4KB yet the host 
 page size is 64KB.  This feature was added to the PPC64 code recently, 
 and is in kernel.org from 2.6.23.  So if the kernel is built with a 4KB 
 page size, no problems.  If the kernel is prior to 2.6.23 then 64KB page 
   configs work too. Its just a problem when the iommu page size != host 
 page size.
 
 It appears that my problem boils down to a single host page of memory 
 that is mapped for dma, and the dma address returned by dma_map_sg() is 
 _not_ 64KB aligned.  Here is an example:
 
 app registers va 0x2d9a3000 len 12288
 ib_umem_get() creates and maps a umem and chunk that looks like (dumping 
 state from a registered user memory region):
 
  umem len 12288 off 12288 pgsz 65536 shift 16
  chunk 0: nmap 1 nents 1
  sglist[0] page 0xc0930b08 off 0 len 65536 dma_addr 
  5bff4000 dma_len 65536
  
 
 So the kernel maps 1 full page for this MR.  But note that the dma 
 address is 5bff4000 which is 4KB aligned, not 64KB aligned.  I 
 think this is causing grief to the RDMA HW.
 
 My first question is: Is there an assumption or requirement in linux 
 that dma_addressess should have the same alignment as the host address 
 they are mapped to?  IE the rdma core is mapping the entire 64KB page, 
 but the mapping doesn't begin on a 64KB page boundary.
 
 If this mapping is considered valid, then perhaps the rdma hw is at 
 fault here.  But I'm wondering if this is an PPC/iommu bug.
 
 BTW:  Here is what the Memory Region looks like to the HW:
 
  TPT entry:  stag idx 0x2e800 key 0xff state VAL type NSMR pdid 0x2
  perms RW rem_inv_dis 0 addr_type VATO
  bind_enable 1 pg_size 65536 qpid 0x0 pbl_addr 0x003c67c0
  len 12288 va 2d9a3000 bind_cnt 0
  PBL: 5bff4000
 
 
 
 Any thoughts?

The Ammasso certainly works this way. If you tell it the page size is
64KB, it will ignore bits in the page address that encode 0-65535.

 
 Steve.
 
 
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] OFA server patching

2007-11-29 Thread Tom Tucker


On Thu, 2007-11-29 at 09:07 -0600, Steve Wise wrote:
 Jeff Becker wrote:
  Hi all. In the interest of keeping our server up to date, I applied the
  latest Ubuntu patches. Several upgrades were made, including git. If you
  have any problems, let me know. Thanks.
  
  -jeff
  ___
  general mailing list
  general@lists.openfabrics.org
  http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
  
  To unsubscribe, please visit 
  http://openib.org/mailman/listinfo/openib-general
 
 
 Git seems broken for me.  I can no longer use the build_ofa_kernel.sh 
 script.  I get this sort of error:
 
 
  fatal: corrupted pack file 
  /home/vlad/scm/ofed_1_2/.git/objects/pack/pack-914d44
  0d906ffa47a30611df81c0597e896040fa.pack

I think the version of git you're using is old and doesn't recognize
some of the object types in the repository. I saw this same thing when I
tried to use a git tree that had remotes created with a newer version of
git.

  
  Failed executing git
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] NFS-RDMA for OFED 1.3

2007-11-29 Thread Tom Tucker


Jeff:

There's an updated version of the server transport switch and rdma
transport provider available here:

git://linux-nfs.org/~tomtucker/nfs-rdma-dev-2.6.git


Tom


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] Re: iWARP peer-to-peer CM proposal

2007-11-28 Thread Tom Tucker

On Wed, 2007-11-28 at 11:43 -0500, Caitlin Bestler wrote:

  -Original Message-
  From: Steve Wise [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, November 27, 2007 4:48 PM
  To: Caitlin Bestler
  Cc: Kanevsky, Arkady; Glenn Grundstrom; Leonid Grossman; openib-
  [EMAIL PROTECTED]
  Subject: Re: [ofa-general] Re: iWARP peer-to-peer CM proposal

  Caitlin Bestler wrote:
   On Nov 27, 2007 3:58 PM, Steve Wise [EMAIL PROTECTED]
  wrote:

   For the short term, I claim we just implement this as part of linux
   iwarp connection setup (mandating a 0B read be sent from the active
   side).  Your proposal to add meta-data to the private data requires
  a
   standards change anyway and is, IMO, the 2nd phase of this whole
   enchilada...

   Steve.

   I don't see how you can have any solution here that does not require
  meta-data.
   For non-peer-to-peer connections neither a zero length RDMA Read or
  Write
   should be sent. An extraneous RDMA Read is particularly onerous for a
  short
   lived connection that fits the classic active/passive model. So
  *something*
   is telling the CMA layer that this connection may need an MPA unjam
  action.
   If that isn't meta-data, what is it?

  I assumed the 0B read would _always_ be sent as part of establishing an
  iWARP connection using linux and the rdma-cm.

 That is an extra round-trip per connection setup, which is a significant
 penalty for a short lived connection. It is trivial for HPC/peer-to-peer
 applications, but would be a killer for something like HTTP over RDMA.

I find it hard to get excited about optimizing short lived connections
for RDMA. I simply don't think it's an interesting use case. And btw,
HTTP long ago got rid of short lived connections because it's painful
even on TCP.

 Doing something like this for *every* connection makes it effectively
 a change to the MPA protocol.

Uh. No, it doesn't. Normalizing the behavior of applications during
connection setup doesn't change the underlying protocol. It adds another
one on top.

  OFA is not the forum for such discussions,
 the IETF is.

My living room, the dinner table, the local bar and this mailing list
are perfectly acceptable forums for discussing a protocol. The IETF is
the forum for standardizing one. Right now, I don't think we're ready to
standardize, because we're still exploring the options; the first of
which is NOT changing MPA.

This group has the unique benefit of actually USING and IMPLEMENTING the
protocol and therefore has some beneficial insights that may and should
be shared. All that said revving the MPA protocol is way down the road. 

 OFA drafting an understanding of how peer-to-peer applications use the
 existing protocol, on the other hand, is quite reasonable. 

That's step 1 and the 0B READ is one way to do it.

 But it has
 to be something done by peer-to-peer middleware or by the verbs layer
 in response to a flag from the peer-to-peer middleware. Otherwise it
 is not augmenting a protocol, it is changing it.

The flag may be useful, however, I don't see the connection between the
flag and complying with the MPA protocol.

 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] [RFP] support for iWARP requirement - activeconnectside MUST send first FPDU

2007-10-24 Thread Tom Tucker

Felix Marti wrote:

-Original Message-
From: Tom Tucker [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 23, 2007 9:32 PM
To: Felix Marti
Cc: Kanevsky, Arkady; Glenn Grundstrom; Sean Hefty; Steve Wise; Roland
Dreier; [EMAIL PROTECTED]; OpenFabrics General
Subject: Re: [ofa-general] [RFP] support for iWARP requirement -
activeconnectside MUST send first FPDU

Felix Marti wrote:

-Original Message-
From: [EMAIL PROTECTED] [mailto:general-
[EMAIL PROTECTED] On Behalf Of Kanevsky, Arkady
Sent: Tuesday, October 23, 2007 6:26 PM
To: Glenn Grundstrom; Sean Hefty; Steve Wise
Cc: Roland Dreier; [EMAIL PROTECTED]; OpenFabrics
General
Subject: RE: [ofa-general] [RFP] support for iWARP requirement -
activeconnectside MUST send first FPDU

This is still a protocol and should be defined by IETF not OFA.
But if we get agreement from all iWARP vendors this will be a good
step.

[felix] This will not work with a Chelsio RNIC which follows the

IETF

specification by a) not issuing a 0B RDMA Write to the wire and b)
silently consuming an incoming 0B write. Therefore 0B RDMA Writes

cannot

be 'abused' for such a synchronization mechanism. I believe that the
mentioned apps adhering to the iWarp requirement do a 'send' from

the

active side and only have the passive side issue RDMA ops once the
incoming send has been received. I would guess that following a

similar

model is the best way to go and supported by all iWarp vendors
implementing the IETF spec.

IMO, the iWARP vendors _must_ get together and work on MPA '2'.
Standardizing FPDU 'abuse' might be a good place to start, but it

needs

to be fixed to support peer-to-peer going forward.

In the mean-time, imperfectly hiding the issue in the Firmware, uDAPL,
the iWARP CM or anywhere else except the application seems to me to be
the only customer friendly solution.

[felix] While I'm not against trying to hide the connection migration
details somewhere below the ULP, I'm not convinced that the issue is as
severe as you make it to be and I would not press to have the issue
resolved in a matter that requires a new MPA version. In fact, the
different rdma transports (and maybe even different versions of the same
transport (in the case of IB)) provide different features and I would
assume that ULPs will eventually code to these features and must thus be
aware of the underlying transport protocol. In that bigger picture, the
connection migration issue at hand seems fairly trivial to solve even if
it requires an ULP change... 

I didn't make an argument about severity. Qualifying the severity is in 
the customer's purview. I'm simply pointing out the following: a) the 
perspective that the restriction is trivial is how we got here, b) 
making the app change is putting a decision in the customer's hands that 
IMO an iWARP vendor would rather they didn't have to make Do I or don't 
I support iWARP?, and c) you have the power to hide this behavior for 
most cases.

Finally, I believe RFC means Request for Comment. Well here's one last 
comment -- Add an FPDU message at the end of MPA exchange and fix the 
problem in the protocol.

If we can not get agreement on it on reflector lets do
it at SC'07 OFA dev. conference.

Arkady Kanevsky   email: [EMAIL PROTECTED]
Network Appliance Inc.   phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
Waltham, MA 02451   central phone: 781-768-5300

-Original Message-
From: Glenn Grundstrom [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 23, 2007 9:02 PM
To: Sean Hefty; Steve Wise
Cc: Roland Dreier; [EMAIL PROTECTED];
OpenFabrics General
Subject: RE: [ofa-general] [RFP] support for iWARP
requirement - activeconnect side MUST send first FPDU

That is what I've been trying to push.  Both MVAPICH2 and

OMPI have been

open to adjusting their transports to adhere to this

requirement.

I wouldn't mind implementing something to enforce this in

the IWCM or

the iWARP drivers IF there was a clean way to do it.  So

far there

hasn't been a clean way proposed.

Why can't either uDAPL or iW CM always do a send from the active

to

passive side that gets stripped off?  From the active side,

the first

send is always posted before any user sends, and if

necessary, a user

send can be queued by software to avoid a QP/CQ overrun.  The
completion can simply be eaten by software.  On the passive

side, you

have a similar process for receiving the data.

This is similar to an option in the NetEffect driver.  A zero
byte RDMA write is sent from the active side and accounted
for on the passive side.  This can

Re: [ofa-general] [RFP] support for iWARP requirement - active connect side MUST send first FPDU

2007-10-23 Thread Tom Tucker


Michael Krause wrote:

At 01:17 PM 10/23/2007, Steve Wise wrote:



Sean Hefty wrote:
There has been much discussion on a private thread regarding bug 
#735 - dapltest performance tests don't adhere to iWARP standard 
that needs to move to the general list.

This bug would be better titled iWarp cannot support uDAPL API. :)
Seriously, the iWarp and uDAPL specs conflict.  One needs to change.

Can someone come up with a solution, possibly in iWARP CM, that 
will work and insure interoperability between iWARP devices?
I thought the restriction was there to support switching between 
streaming and rdma mode.  If a connection only uses rdma mode, is 
the restriction really needed at all?


Yes because all iWARP connections start out as TCP streaming mode 
connections, and the MPA startup messages are sent in streaming mode. 
Then the connection is  transitioned into FPDU (Framed PDU) mode 
using the MPA protocol.


Correct.  The IETF was very clear on these requirements (significant 
debate occurred over at least 12-18 months) and there is unlikely to 
be any traction in changing the iWARP specifications to provide 
another mechanism.  Best to provide API that detect which semantics 
are required and then if the application cannot adjust, then it cannot 
use the iWARP semantics.
First let me apologize in advance, but that is simply not a workable 
solution for the customer. I'm not taking anything away from the efforts 
of those involved with the definition of the MPA protocol, however, 
unfortunately that protracted debate occurred 2-3 years in advance of a 
deployed solution. The duration of the debate doesn't overcome the 
absence of practical perspective.


There are now multiple implementations, the customers of which are 
complaining about the cost of the compromises made. We now have the 
benefit of hindsight and in my option should rev the MPA protocol. After 
all, that's why the number is there in the header -- right? It may be 
that those involved with the original debate have no interest in 
revisiting it, but IMO that is irrelevant. There are now new companies 
involved that implemented RDDP, have customers using it, and have a 
sustaining (both interpretations intended) interest in making RDDP 
better. I, for one, would encourage them to do so.


Protocols are not immutable, unless they're dead.


BTW, if one uses the SDP port mapper protocol (see the IETF SDP 
specification), one can detect from the start that RDMA is being used 
and one could start in RDMA mode sans the MPA requirement.   The SDP 
port mapper protocol also enables one to apply various other policies 
such as determining whether the application / remote node session 
should be allowed to run over RDMA or not - simple point of control 
for management.



Really? What about CRC, Markers and Private Data?

Mike


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] [RFP] support for iWARP requirement - activeconnectside MUST send first FPDU

2007-10-23 Thread Tom Tucker

Felix Marti wrote:

-Original Message-
From: [EMAIL PROTECTED] [mailto:general-
[EMAIL PROTECTED] On Behalf Of Kanevsky, Arkady
Sent: Tuesday, October 23, 2007 6:26 PM
To: Glenn Grundstrom; Sean Hefty; Steve Wise
Cc: Roland Dreier; [EMAIL PROTECTED]; OpenFabrics
General
Subject: RE: [ofa-general] [RFP] support for iWARP requirement -
activeconnectside MUST send first FPDU

This is still a protocol and should be defined by IETF not OFA.
But if we get agreement from all iWARP vendors this will be a good
step.

[felix] This will not work with a Chelsio RNIC which follows the IETF
specification by a) not issuing a 0B RDMA Write to the wire and b)
silently consuming an incoming 0B write. Therefore 0B RDMA Writes cannot
be 'abused' for such a synchronization mechanism. I believe that the
mentioned apps adhering to the iWarp requirement do a 'send' from the
active side and only have the passive side issue RDMA ops once the
incoming send has been received. I would guess that following a similar
model is the best way to go and supported by all iWarp vendors
implementing the IETF spec.

IMO, the iWARP vendors _must_ get together and work on MPA '2'. 
Standardizing FPDU 'abuse' might be a good place to start, but it needs 
to be fixed to support peer-to-peer going forward.

In the mean-time, imperfectly hiding the issue in the Firmware, uDAPL, 
the iWARP CM or anywhere else except the application seems to me to be 
the only customer friendly solution.

If we can not get agreement on it on reflector lets do
it at SC'07 OFA dev. conference.

Arkady Kanevsky   email: [EMAIL PROTECTED]
Network Appliance Inc.   phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
Waltham, MA 02451   central phone: 781-768-5300

-Original Message-
From: Glenn Grundstrom [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 23, 2007 9:02 PM
To: Sean Hefty; Steve Wise
Cc: Roland Dreier; [EMAIL PROTECTED];
OpenFabrics General
Subject: RE: [ofa-general] [RFP] support for iWARP
requirement - activeconnect side MUST send first FPDU

That is what I've been trying to push.  Both MVAPICH2 and

OMPI have been

open to adjusting their transports to adhere to this

requirement.

I wouldn't mind implementing something to enforce this in

the IWCM or

the iWARP drivers IF there was a clean way to do it.  So

far there

hasn't been a clean way proposed.

Why can't either uDAPL or iW CM always do a send from the active

to

passive side that gets stripped off?  From the active side,

the first

send is always posted before any user sends, and if

necessary, a user

send can be queued by software to avoid a QP/CQ overrun.  The
completion can simply be eaten by software.  On the passive

side, you

have a similar process for receiving the data.

This is similar to an option in the NetEffect driver.  A zero
byte RDMA write is sent from the active side and accounted
for on the passive side.  This can be turned on and off by
compile and module options for compatibility.

I second Sean's question - why can't uDAPL or the iw_cm do this?

(Yes this adds wire protocol, which requires both sides to support
it.)

- Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit

http://openib.org/mailman/listinfo/openib-

general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] iSER data corruption issues

2007-10-04 Thread Tom Tucker

On Thu, 2007-10-04 at 12:14 -0400, Pete Wyckoff wrote:
 [EMAIL PROTECTED] wrote on Wed, 03 Oct 2007 15:01 -0700:
Machines are opteron, fedora 7 up-to-date with its openfab libs,
kernel 2.6.23-rc6 on target.  Either 2.6.23-rc6 or 2.6.22 or
2.6.18-rhel5 on initiator.  For some reason, it is much easier to
produce with the rhel5 kernel.
  
  There was a bug in mthca that caused data corruption with FMRs on
  Sinai (1-port PCIe) HCAs.  It was fixed in commit 608d8268 (IB/mthca:
  Fix data corruption after FMR unmap on Sinai) which went in shortly
  before 2.6.21 was released.  I don't know if the RHEL5 2.6.18 kernel
  has this fix or not -- but if you still see the problem on 2.6.22 and
  later kernels then this isn't the fix anyway.
 
 This is definitely it.  Same test setup runs for an hour with this
 patch, but fails in tens of seconds without it.  Thanks for pointing
 it out.
 
 This rhel5 kernel is 2.6.18-8.1.6.  Perhaps there are newer ones
 about that have this critical patch included.  I'm going to add a
 Big Fat Warning on the iser distribution about pre-2.6.21 kernels.
 It also crashes if the iSER connection drops in a certain
 easy-to-reproduce way, another reason to avoid it.
 
 Regarding the larger test I talked about that fails even on modern
 kernels, I'm still not able to reproduce that on my setup.  I ran it
 literally all night with a hacked target that calculated the return
 buffer rather than accessing the disk.  For now I'm calling that a
 separate bug and will investigate it further.
 
 Thanks to Tom and Tom for helping debug this.
 

Thanks to Roland who actually knew what it was ... ;-)


   -- Pete
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] iSER data corruption issues

2007-10-03 Thread Tom Tucker

On Wed, 2007-10-03 at 13:42 -0400, Pete Wyckoff wrote: 
 How does the requester (in IB speak) know that an RDMA Write
 operation has completed on the responder?
 
 We have a software iSER target, available at git.osc.edu/tgt or
 browse at http://git.osc.edu/?p=tgt.git .  Using the existing
 in-kernel iSER initiator code, very rarely data corruption occurs,
 in that the received data from SCSI read operations does not match
 what was expected.  Sometimes it appears as if random kernel memory
 has been scribbled on by an errant RDMA write from the target.  My
 current working theory that the RDMA write has not completed by the
 time the initiator looks at its incoming data buffer.
 
 Single RC QP, single CQ, no SRQ.  Only Send, Receive, and RDMA Write
 work requests are used.  After everything is connected up, a SCSI
 read sequence looks like:
 
 initiator: register pages with FMR, write test pattern
 initiator: Send request to target
 target:Recv request
 target:RDMA Write response to initiator
 target:Wait for CQ entry for local RDMA Write completion
Pete:

I don't think this should be necessary...

 target:Send response to initiator

...as long as the send is posted on the same SQ as the write.

 initiator: Recv response, access buffer
 
 On very rare occasions, this buffer will have the test pattern, not
 the data that the target just sent.
 
 Machines are opteron, fedora 7 up-to-date with its openfab libs,
 kernel 2.6.23-rc6 on target.  Either 2.6.23-rc6 or 2.6.22 or
 2.6.18-rhel5 on initiator.  For some reason, it is much easier to
 produce with the rhel5 kernel.  One site with fast disks can see
 similar corruption with 2.6.23-rc6, however.  Target is pure
 userspace.  Initiator is in kernel and is poked by lmdd (like
 normal dd) through an iSCSI block device (/dev/sdb).
 
 The IB spec seems to indicate that the contents of the RDMA Write
 buffer should be stable after completion of a subsequent send
 message (o9-20).  In fact, the Wait for CQ entry step on the
 target should be unnecessary, no?

I think so too.

 
 Could there be some caching issues that the initiator is missing?
 I've added print[fk]s to the initiator and target to verify that the
 sequence of events is truly as above, and that the virtual addresses
 are as expected on both sides.
 
 Any suggestions or advice would help.  Thanks,
 

If your theory is correct, the data should eventually show up. Does it?

Does your code check for errors on dma_map_single/page? 

   -- Pete
 
 
 P.S.  Here are some debugging printfs not in the git.
 
 Userspace code does 200 read()s of length 8000, but complains about
 the result somewhere in the 14th read, from 112000 to 12, and
 exits early.  Expected pattern is a series of 40 4-byte words,
 incrementing by 4, starting from 0.  So 0x, 0x0004, ...,
 0x001869fc:
 
 % lmdd of=internal ipat=1 if=/dev/sdb bs=8000 count=200 mismatch=10
 off=112000 want=1c000 got=3b3b3b3b
 
 Initiator generates a series of SCSI operations, as driven by
 readahead and the block queue scheduler.  You can see that it starts
 reading 4 pages, then 1 page, then 23 pages, then 1 page and so on,
 in order.  These sizes and offsets vary from run to run.  Each line
 here is printed after the SCSI read response has been received.  It
 prints the first word in the buffer, and you can see the test
 pattern where data should be:
 
 tag 02 va 36061000 len  4000 word0  ref 1
 tag 03 va 36065000 len  1000 word0 4000 ref 1
 tag 04 va 36066000 len 17000 word0 5000 ref 1
 tag 05 va 7b6bc000 len  1000 word0 3b3b3b3b ref 1

Is it interesting that the bad word occurs on the first page of the new
map?

 tag 06 va 7b6bd000 len 1f000 word0 0001d000 ref 1
 tag 07 va 7bdc2000 len 2 word0 0003c000 ref 1
 
 The userspace target code prints a line when it starts the RDMA
 write, then a line when the RDMA write completes locally, then a
 line when it sends the repsponse.  The tags are what the initiator
 assigned to each request.  The target thinks it is sending a
 4096-byte buffer that has 0x1c000 as its first word, but the
 initiator did not see it:
 
 tag 02 va 36061000 len  4000 word0  rdmaw
 tag 02 rdmaw completion
 tag 02 resp
 tag 03 va 36065000 len  1000 word0 4000 rdmaw
 tag 03 rdmaw completion
 tag 03 resp
 tag 04 va 36066000 len 17000 word0 5000 rdmaw
 tag 04 rdmaw completion
 tag 04 resp
 tag 05 va 7b6bc000 len  1000 word0 0001c000 rdmaw
 tag 05 rdmaw completion
 tag 05 resp
 tag 06 va 7b6bd000 len 1f000 word0 0001d000 rdmaw
 tag 06 rdmaw completion
 tag 07 va 7bdc2000 len 2 word0 0003c000 rdmaw
 tag 07 rdmaw completion
 tag 06 resp
 tag 07 resp
 
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device

2007-09-27 Thread Tom Tucker

On Wed, 2007-09-26 at 14:06 -0500, Jim Mott wrote:
   This is a two part bug report.  One is a conceptual problem that may just 
 be a problem of understanding on my part.  The other is
 what I believe to be a bug in the mlx4 driver.

mthca has the same issue.

 
 1) ib_create_qp() fails with max_sge 
   If you use ib_query_device() to return the device specific 
 attribute max_sge, it seems reasonable to expect you can create
 a QP with max_send_sge=max_sge.  The problem is that this often
 fails.
 
   The reason is that depending on the QP type (RC, UD, etc.) and
 how the QP will be used (send, RDMA, atomic, etc.), there can be
 extra segments required in the WQE that eat up SGE entries.  So
 while some send WQE might have max_sge available SGEs, many will
 not.
 
   Normally the difference between max_sge and the actual maximum
 value allowed (and checked) for max_send_sge is 1 or 2.
 
   This issue may need API extensions to definitively resolve.  In
 the short term, it would be very nice if max_sge reported by 
 ib_query_device() could always return a value that ib_create_qp()
 could use.  Think of it as the minimum max_send_sge value that
 will work for all QP types.
 
 
 2) mlx4 setting of max send SQEs
   The recent patch to support shrinking WQEs introduces a 
 behavior that creates a big difference between the mlx4 
 supported send SGEs (checked against 61, should be 59 or 60,
 and reported in ib_query_device as 32 to equal receive side
 max_rq_sg value).  
 
   The patch that follows will allow an MLX4 to support the
 number of send SGEs returned by ib_query_devce, and in fact
 quite a few more.  It probably breaks shrinking WQEs and thus
 should not be applied directly.
 
   Note that if ib_query_device() returned max_sge adjusted
 for the raddr and atomic segments, this fix would not be
 needed.  MLX4 would still support more SGEs in hardware than
 can be used through the API, but that is a different problem.  
 
 --- ofa_1_3_dev_kernel.orig/drivers/infiniband/hw/mlx4/qp.c 2007-09-26 
 13:27:47.0 -0500
 +++ ofa_1_3_dev_kernel/drivers/infiniband/hw/mlx4/qp.c  2007-09-26 
 13:36:40.0 -0500
 @@ -370,7 +370,7 @@ static int set_kernel_sq_size(struct mlx
 qp-sq.wqe_shift = ilog2(roundup_pow_of_two(s));
  
 for (;;) {
 -   if (1  qp-sq.wqe_shift  dev-dev-caps.max_sq_desc_sz)
 +   if (s  dev-dev-caps.max_sq_desc_sz)
 return -EINVAL;
  
 qp-sq_max_wqes_per_wr = DIV_ROUND_UP(s, 1  
 qp-sq.wqe_shift);
 
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge lower than reported by ib_query_device

2007-09-26 Thread Tom Tucker


FWIW, I have code in my apps that retries QP creation with reduced
values when the allocation with max fails. 

There was also an earlier e-mail thread on this exact same issue, but
the solution bantered about was to use special values in the qp_attr
structure ala QP_MAX_SEND_SGE (-1?). The provider would recognize this
value and allocate the max for that attribute that would succeed given
the current resource situation. The qp_attr structure would then be
updated by the provider with the values given. This approach extends,
but doesn't break the API, allows existing apps to work as usual, and
avoids the retry logic that I've added to my apps.

Just a thought,
Tom

On Wed, 2007-09-26 at 20:41 -0500, Jim Mott wrote:
 The same bug exists with mthca.  I saw it originally in the kernel doing RDS 
 work, but I just put together a short user space test.
 
 ibv_query_device(MT25204) returns max_sge=30
   - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge fails
   - ibv_create_qp with qp_attr.cap.max_send_sge = dev_attr.max_sge-1 works
 
 I only have the two types of adapters to test with.
 -Original Message-
 From: Roland Dreier [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, September 26, 2007 5:32 PM
 To: Jim Mott
 Cc: general@lists.openfabrics.org
 Subject: Re: [ofa-general] [Bug report / partial patch] OFED 1.3 send max_sge 
 lower than reported by ib_query_device
 
   A minimal API change that could help would be to add two new fields
   to ib_device_attr structure returned by ib_query_device:
 - delta_sge_sg
 - delta_sge_rd
 
 Hmm, a cute idea but I'm still left wondering if it's worth the ABI
 breakage etc just to give a few more S/G entries in some situations.
 
   The behavior would be that in all cases using max_sge for send or
   receive SGE count in create_qp would always succeed.  That means
   the current value the drivers return there would have to be reduced
   to fix this bug.  All existing codes would continue to run.
 
 Actually are there any drivers other than patched mlx4 where max_sge
 doesn't always work?  I agree we do want to get this right, but I
 thought we had fixed all such bugs.  (And we should make sure that any
 shrinking WQE patch for mlx4 doesn't introduce new bugs)
 
 (BTW I see a different bug in unpatched mlx4, namely that it might
 report a too-big number of S/G entries allowed for the SQ)
 
   It looks like there is some movement in this direction already
   with the fields:
 - max_sge_rd (nes, amso1100, ehca, cxgb3 only)
 
 This field is obsolete, since we don't handle RD and almost certainly
 never will.  I'm not sure why anyone is setting a value.
 
 - max_srq_sge (amso1100, mthca, mlx4, ehca, ipath only)
 
 Any devices that handle SRQ should set this.  I think cxgb3 does not
 support SRQ.
 
  - R.
 
 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only interfacesto avoid 4-tuple conflicts.

2007-09-24 Thread Tom Tucker

On Mon, 2007-09-24 at 16:30 -0500, Glenn Grundstrom wrote:

  -Original Message-
  From: Roland Dreier [mailto:[EMAIL PROTECTED] 
  Sent: Monday, September 24, 2007 2:33 PM
  To: Glenn Grundstrom
  Cc: Steve Wise; [EMAIL PROTECTED]; general@lists.openfabrics.org
  Subject: Re: [ofa-general] [PATCH v3] iw_cxgb3: Support 
  iwarp-only interfacesto avoid 4-tuple conflicts.

I'm sure I had seen a previous email in this thread that 
  suggested using
a userspace library to open a socket
in the shared port space.  It seems that suggestion was 
  dropped without
reason.  Does anyone know why?

  Yes, because it doesn't handle in-kernel uses (eg NFS/RDMA, 
  iSER, etc).

 The kernel apps could open a Linux tcp socket and create an RDMA
 socket connection.  Both calls are standard Linux kernel architected
 routines. 

This approach was NAK'd by David Miller and others...

  Doesn't NFSoRDMA already open a TCP socket and another for 
 RDMA traffic (ports 2049  2050 if I remember correctly)?  

The NFS RDMA transport driver does not open a socket for the RDMA
connection. It uses a different port in order to allow both TCP and RDMA
mounts to the same filer.

 I currently
 don't know if iSER, RDS, etc. already do the same thing, but if they
 don't, they probably could very easily.

Woe be to those who do so...

  Does the neteffect NIC have the same issue as cxgb3 here?  What are
  your thoughts on how to handle this?

 Yes, the NetEffect RNIC will have the same issue as Chelsio.  And all
 Future RNIC's which support a unified tcp address with Linux will as
 well.

 Steve has put a lot of thought and energy into the problem, but
 I don't think users  admins will be very happy with us in the long run.

Agreed.

 In summary, short of having the rdma_cm share kernel port space, I'd
 like to see the equivalent in userspace and have the kernel apps handle
 the issue in a similar way as described above.  There are a few
 technical
 issues to work through (like passing the userspace IP address to the
 kernel),

This just moves the socket creation to code that is outside the purview
the kernel maintainers. The exchanging of the 4-tuple created with the
kernel module, however, is back in the kernel and in the maintainer's
control and responsibility. In my view anything like this will be viewed
as an attempt to sneak code into the kernel that the maintainer has
already vehemently rejected. This will make people angry and damage the
cooperative working relationship that we are trying to build.

  but I think we can solve that just like other information that
 gets passed from user into the IB/RDMA kernel modules.

Sharing the IP 4-tuple space cooperatively with the core in any fashion
has been nak'd. Without this cooperation, the options we've been able to
come up with are administrative/policy based approaches. 

Any ideas you have along these lines are welcome.

Tom

 Glenn.

   - R.

 ___
 general mailing list
 general@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [RFC,PATCH 00/20] svc: Server Side Transport Switch

2007-08-20 Thread Tom Tucker

This patchest modifies the RPC server side implementation 
to support pluggable transports. This was done in order to allow
RPC applications (NFS) to run over RDMA transports like IB and iWARP.
This patchset was also published to [EMAIL PROTECTED] 

This patchset represents an update to the previously published 
version. The most significant changes are a renaming of the 
transport switch data structures and functions based on a 
recommendation from Chuck Lever. Code cleanup was also done in the
portlist implementation based on feedback from Trond.

I've included the original description below for new reviewers.

This patchset implements a sunrpc server side pluggable transport 
switch that supports dynamically registered transports. 

The knfsd daemon has been modified to allow user-mode programs 
to add a new listening endpoint by writing a string
to the portlist file. The format of the string is as follows:

transport-name port

For example,

# echo rdma 2050  /proc/fs/nfsd/portlist

Will cause the knfsd daemon to attempt to add a listening endpoint on 
port 2050 using the 'rdma' transport.

Transports register themselves with the transport switch using a
new API that has the following synopsis:

void svc_register_transport(struct svc_sock_ops *xprt)

The text transport name is contained in a field in the xprt structure.
A new service has been added as well to take a transport name
instead of an IP protocol number to specify the transport on which the 
listening endpoint is to be created. This function is defined as follows:

int svc_create_svcsock(struct svc_serv, char *transport_name, 
   unsigned short port, int flags);

The existing svc_makesock interface was left to avoid impacts to existing 
servers. It has been modified to map IP protocol numbers to transport 
strings.

-- 
Signed-off-by: Tom Tucker [EMAIL PROTECTED]
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [RFC, PATCH 01/20] svc: Add svc_xprt transport switch structure

2007-08-20 Thread Tom Tucker


Start moving to a transport switch for knfsd.  Add a svc_xprt
switch and move the sk_sendto and sk_recvfrom function
pointers into it.

Signed-off-by: Greg Banks [EMAIL PROTECTED]
Signed-off-by: Peter Leckie [EMAIL PROTECTED]
Signed-off-by: Tom Tucker [EMAIL PROTECTED]
---

 include/linux/sunrpc/svcsock.h |9 +++--
 net/sunrpc/svcsock.c   |   22 --
 2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index e21dd93..4792ed6 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -11,6 +11,12 @@ #define SUNRPC_SVCSOCK_H
 
 #include linux/sunrpc/svc.h
 
+struct svc_xprt {
+   const char  *xpt_name;
+   int (*xpt_recvfrom)(struct svc_rqst *rqstp);
+   int (*xpt_sendto)(struct svc_rqst *rqstp);
+};
+
 /*
  * RPC server socket.
  */
@@ -43,8 +49,7 @@ #define   SK_DETACHED 10  /* 
detached fro
 * be revisted */
struct mutexsk_mutex;   /* to serialize sending data */
 
-   int (*sk_recvfrom)(struct svc_rqst *rqstp);
-   int (*sk_sendto)(struct svc_rqst *rqstp);
+   const struct svc_xprt  *sk_xprt;
 
/* We keep the old state_change and data_ready CB's here */
void(*sk_ostate)(struct sock *);
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 5baf48d..789d94a 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -885,6 +885,12 @@ svc_udp_sendto(struct svc_rqst *rqstp)
return error;
 }
 
+static const struct svc_xprt svc_udp_xprt = {
+   .xpt_name = udp,
+   .xpt_recvfrom = svc_udp_recvfrom,
+   .xpt_sendto = svc_udp_sendto,
+};
+
 static void
 svc_udp_init(struct svc_sock *svsk)
 {
@@ -893,8 +899,7 @@ svc_udp_init(struct svc_sock *svsk)
 
svsk-sk_sk-sk_data_ready = svc_udp_data_ready;
svsk-sk_sk-sk_write_space = svc_write_space;
-   svsk-sk_recvfrom = svc_udp_recvfrom;
-   svsk-sk_sendto = svc_udp_sendto;
+   svsk-sk_xprt = svc_udp_xprt;
 
/* initialise setting must have enough space to
 * receive and respond to one request.
@@ -1322,14 +1327,19 @@ svc_tcp_sendto(struct svc_rqst *rqstp)
return sent;
 }
 
+static const struct svc_xprt svc_tcp_xprt = {
+   .xpt_name = tcp,
+   .xpt_recvfrom = svc_tcp_recvfrom,
+   .xpt_sendto = svc_tcp_sendto,
+};
+
 static void
 svc_tcp_init(struct svc_sock *svsk)
 {
struct sock *sk = svsk-sk_sk;
struct tcp_sock *tp = tcp_sk(sk);
 
-   svsk-sk_recvfrom = svc_tcp_recvfrom;
-   svsk-sk_sendto = svc_tcp_sendto;
+   svsk-sk_xprt = svc_tcp_xprt;
 
if (sk-sk_state == TCP_LISTEN) {
dprintk(setting up TCP socket for listening\n);
@@ -1477,7 +1487,7 @@ svc_recv(struct svc_rqst *rqstp, long ti
 
dprintk(svc: server %p, pool %u, socket %p, inuse=%d\n,
 rqstp, pool-sp_id, svsk, atomic_read(svsk-sk_inuse));
-   len = svsk-sk_recvfrom(rqstp);
+   len = svsk-sk_xprt-xpt_recvfrom(rqstp);
dprintk(svc: got len=%d\n, len);
 
/* No data, incomplete (TCP) read, or accept() */
@@ -1537,7 +1547,7 @@ svc_send(struct svc_rqst *rqstp)
if (test_bit(SK_DEAD, svsk-sk_flags))
len = -ENOTCONN;
else
-   len = svsk-sk_sendto(rqstp);
+   len = svsk-sk_xprt-xpt_sendto(rqstp);
mutex_unlock(svsk-sk_mutex);
svc_sock_release(rqstp);
 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [RFC,PATCH 02/20] svc: xpt_detach and xpt_free

2007-08-20 Thread Tom Tucker


Add transport switch functions to ensure that no additional receive 
ready events will be delivered by the transport (xpt_detach), 
and another to free memory associated with the transport (xpt_free).
Change svc_delete_socket() and svc_sock_put() to use the new
transport functions.

Signed-off-by: Greg Banks [EMAIL PROTECTED]
Signed-off-by: Peter Leckie [EMAIL PROTECTED]
Signed-off-by: Tom Tucker [EMAIL PROTECTED]
---

 include/linux/sunrpc/svcsock.h |   12 ++
 net/sunrpc/svcsock.c   |   50 +---
 2 files changed, 53 insertions(+), 9 deletions(-)

diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index 4792ed6..27c5b1f 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -15,6 +15,18 @@ struct svc_xprt {
const char  *xpt_name;
int (*xpt_recvfrom)(struct svc_rqst *rqstp);
int (*xpt_sendto)(struct svc_rqst *rqstp);
+   /*
+* Detach the svc_sock from it's socket, so that the
+* svc_sock will not be enqueued any more.  This is
+* the first stage in the destruction of a svc_sock.
+*/
+   void(*xpt_detach)(struct svc_sock *);
+   /*
+* Release all network-level resources held by the svc_sock,
+* and the svc_sock itself.  This is the final stage in the
+* destruction of a svc_sock.
+*/
+   void(*xpt_free)(struct svc_sock *);
 };
 
 /*
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 789d94a..4956c88 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -84,6 +84,8 @@ static void   svc_udp_data_ready(struct s
 static int svc_udp_recvfrom(struct svc_rqst *);
 static int svc_udp_sendto(struct svc_rqst *);
 static voidsvc_close_socket(struct svc_sock *svsk);
+static voidsvc_sock_detach(struct svc_sock *);
+static voidsvc_sock_free(struct svc_sock *);
 
 static struct svc_deferred_req *svc_deferred_dequeue(struct svc_sock *svsk);
 static int svc_deferred_recv(struct svc_rqst *rqstp);
@@ -378,14 +380,9 @@ svc_sock_put(struct svc_sock *svsk)
if (atomic_dec_and_test(svsk-sk_inuse)) {
BUG_ON(! test_bit(SK_DEAD, svsk-sk_flags));
 
-   dprintk(svc: releasing dead socket\n);
-   if (svsk-sk_sock-file)
-   sockfd_put(svsk-sk_sock);
-   else
-   sock_release(svsk-sk_sock);
if (svsk-sk_info_authunix != NULL)
svcauth_unix_info_release(svsk-sk_info_authunix);
-   kfree(svsk);
+   svsk-sk_xprt-xpt_free(svsk);
}
 }
 
@@ -889,6 +886,8 @@ static const struct svc_xprt svc_udp_xpr
.xpt_name = udp,
.xpt_recvfrom = svc_udp_recvfrom,
.xpt_sendto = svc_udp_sendto,
+   .xpt_detach = svc_sock_detach,
+   .xpt_free = svc_sock_free,
 };
 
 static void
@@ -1331,6 +1330,8 @@ static const struct svc_xprt svc_tcp_xpr
.xpt_name = tcp,
.xpt_recvfrom = svc_tcp_recvfrom,
.xpt_sendto = svc_tcp_sendto,
+   .xpt_detach = svc_sock_detach,
+   .xpt_free = svc_sock_free,
 };
 
 static void
@@ -1770,6 +1771,38 @@ bummer:
 }
 
 /*
+ * Detach the svc_sock from the socket so that no
+ * more callbacks occur.
+ */
+static void
+svc_sock_detach(struct svc_sock *svsk)
+{
+   struct sock *sk = svsk-sk_sk;
+
+   dprintk(svc: svc_sock_detach(%p)\n, svsk);
+
+   /* put back the old socket callbacks */
+   sk-sk_state_change = svsk-sk_ostate;
+   sk-sk_data_ready = svsk-sk_odata;
+   sk-sk_write_space = svsk-sk_owspace;
+}
+
+/*
+ * Free the svc_sock's socket resources and the svc_sock itself.
+ */
+static void
+svc_sock_free(struct svc_sock *svsk)
+{
+   dprintk(svc: svc_sock_free(%p)\n, svsk);
+
+   if (svsk-sk_sock-file)
+   sockfd_put(svsk-sk_sock);
+   else
+   sock_release(svsk-sk_sock);
+   kfree(svsk);
+}
+
+/*
  * Remove a dead socket
  */
 static void
@@ -1783,9 +1816,8 @@ svc_delete_socket(struct svc_sock *svsk)
serv = svsk-sk_server;
sk = svsk-sk_sk;
 
-   sk-sk_state_change = svsk-sk_ostate;
-   sk-sk_data_ready = svsk-sk_odata;
-   sk-sk_write_space = svsk-sk_owspace;
+   if (svsk-sk_xprt-xpt_detach)
+   svsk-sk_xprt-xpt_detach(svsk);
 
spin_lock_bh(serv-sv_lock);
 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [RFC,PATCH 03/20] svc: xpt_prep_reply_hdr

2007-08-20 Thread Tom Tucker


Add a transport function that prepares the transport specific header for 
RPC replies. UDP has none, TCP has a 4B record length. This will
allow the RDMA transport to prepare it's variable length reply
header as well.

Signed-off-by: Tom Tucker [EMAIL PROTECTED]
---

 include/linux/sunrpc/svcsock.h |4 
 net/sunrpc/svc.c   |8 +---
 net/sunrpc/svcsock.c   |   15 +++
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index 27c5b1f..1da42c2 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -27,6 +27,10 @@ struct svc_xprt {
 * destruction of a svc_sock.
 */
void(*xpt_free)(struct svc_sock *);
+   /*
+* Prepare any transport-specific RPC header.
+*/
+   int (*xpt_prep_reply_hdr)(struct svc_rqst *);
 };
 
 /*
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index e673ef9..72a900f 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -815,9 +815,11 @@ svc_process(struct svc_rqst *rqstp)
rqstp-rq_res.tail[0].iov_len = 0;
/* Will be turned off only in gss privacy case: */
rqstp-rq_sendfile_ok = 1;
-   /* tcp needs a space for the record length... */
-   if (rqstp-rq_prot == IPPROTO_TCP)
-   svc_putnl(resv, 0);
+
+   /* setup response header. */
+   if (rqstp-rq_sock-sk_xprt-xpt_prep_reply_hdr 
+   rqstp-rq_sock-sk_xprt-xpt_prep_reply_hdr(rqstp))
+   goto dropit;
 
rqstp-rq_xid = svc_getu32(argv);
svc_putu32(resv, rqstp-rq_xid);
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 4956c88..ca473ee 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1326,12 +1326,27 @@ svc_tcp_sendto(struct svc_rqst *rqstp)
return sent;
 }
 
+/*
+ * Setup response header. TCP has a 4B record length field.
+ */
+static int
+svc_tcp_prep_reply_hdr(struct svc_rqst *rqstp)
+{
+   struct kvec *resv = rqstp-rq_res.head[0];
+
+   /* tcp needs a space for the record length... */
+   svc_putnl(resv, 0);
+
+   return 0;
+}
+
 static const struct svc_xprt svc_tcp_xprt = {
.xpt_name = tcp,
.xpt_recvfrom = svc_tcp_recvfrom,
.xpt_sendto = svc_tcp_sendto,
.xpt_detach = svc_sock_detach,
.xpt_free = svc_sock_free,
+   .xpt_prep_reply_hdr = svc_tcp_prep_reply_hdr,
 };
 
 static void
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [RFC,PATCH 04/20] svc: xpt_has_wspace

2007-08-20 Thread Tom Tucker


Move the code that checks for available write space on the socket, 
into a new transport function. This will allow transports flexibility
when determining if enough space/memory is available to process
the reply. The role of this function for RDMA is to avoid stalling
an knfsd thread when SQ space is not available.

Signed-off-by: Greg Banks [EMAIL PROTECTED]
Signed-off-by: Peter Leckie [EMAIL PROTECTED]
Signed-off-by: Tom Tucker [EMAIL PROTECTED]
---

 include/linux/sunrpc/svcsock.h |4 ++
 net/sunrpc/svcsock.c   |   75 ++--
 2 files changed, 52 insertions(+), 27 deletions(-)

diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index 1da42c2..3faa95c 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -31,6 +31,10 @@ struct svc_xprt {
 * Prepare any transport-specific RPC header.
 */
int (*xpt_prep_reply_hdr)(struct svc_rqst *);
+   /*
+* Return 1 if sufficient space to write reply to network.
+*/
+   int (*xpt_has_wspace)(struct svc_sock *);
 };
 
 /*
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index ca473ee..b16dad4 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -205,22 +205,6 @@ svc_release_skb(struct svc_rqst *rqstp)
 }
 
 /*
- * Any space to write?
- */
-static inline unsigned long
-svc_sock_wspace(struct svc_sock *svsk)
-{
-   int wspace;
-
-   if (svsk-sk_sock-type == SOCK_STREAM)
-   wspace = sk_stream_wspace(svsk-sk_sk);
-   else
-   wspace = sock_wspace(svsk-sk_sk);
-
-   return wspace;
-}
-
-/*
  * Queue up a socket with data pending. If there are idle nfsd
  * processes, wake 'em up.
  *
@@ -269,21 +253,13 @@ svc_sock_enqueue(struct svc_sock *svsk)
BUG_ON(svsk-sk_pool != NULL);
svsk-sk_pool = pool;
 
-   set_bit(SOCK_NOSPACE, svsk-sk_sock-flags);
-   if (((atomic_read(svsk-sk_reserved) + serv-sv_max_mesg)*2
- svc_sock_wspace(svsk))
-!test_bit(SK_CLOSE, svsk-sk_flags)
-!test_bit(SK_CONN, svsk-sk_flags)) {
-   /* Don't enqueue while not enough space for reply */
-   dprintk(svc: socket %p  no space, %d*2  %ld, not enqueued\n,
-   svsk-sk_sk, 
atomic_read(svsk-sk_reserved)+serv-sv_max_mesg,
-   svc_sock_wspace(svsk));
+   if (!test_bit(SK_CLOSE, svsk-sk_flags)
+!test_bit(SK_CONN, svsk-sk_flags)
+!svsk-sk_xprt-xpt_has_wspace(svsk)) {
svsk-sk_pool = NULL;
clear_bit(SK_BUSY, svsk-sk_flags);
goto out_unlock;
}
-   clear_bit(SOCK_NOSPACE, svsk-sk_sock-flags);
-
 
if (!list_empty(pool-sp_threads)) {
rqstp = list_entry(pool-sp_threads.next,
@@ -882,12 +858,45 @@ svc_udp_sendto(struct svc_rqst *rqstp)
return error;
 }
 
+/**
+ * svc_sock_has_write_space - Checks if there is enough space
+ * to send the reply on the socket.
+ * @svsk: the svc_sock to write on
+ * @wspace: the number of bytes available for writing
+ */
+static int svc_sock_has_write_space(struct svc_sock *svsk, int wspace)
+{
+   struct svc_serv *serv = svsk-sk_server;
+   int required = atomic_read(svsk-sk_reserved) + serv-sv_max_mesg;
+
+   if (required*2  wspace) {
+   /* Don't enqueue while not enough space for reply */
+   dprintk(svc: socket %p  no space, %d*2  %d, not enqueued\n,
+   svsk-sk_sk, required, wspace);
+   return 0;
+   }
+   clear_bit(SOCK_NOSPACE, svsk-sk_sock-flags);
+   return 1;
+}
+
+static int
+svc_udp_has_wspace(struct svc_sock *svsk)
+{
+   /*
+* Set the SOCK_NOSPACE flag before checking the available
+* sock space.
+*/
+   set_bit(SOCK_NOSPACE, svsk-sk_sock-flags);
+   return svc_sock_has_write_space(svsk, sock_wspace(svsk-sk_sk));
+}
+
 static const struct svc_xprt svc_udp_xprt = {
.xpt_name = udp,
.xpt_recvfrom = svc_udp_recvfrom,
.xpt_sendto = svc_udp_sendto,
.xpt_detach = svc_sock_detach,
.xpt_free = svc_sock_free,
+   .xpt_has_wspace = svc_udp_has_wspace,
 };
 
 static void
@@ -1340,6 +1349,17 @@ svc_tcp_prep_reply_hdr(struct svc_rqst *
return 0;
 }
 
+static int
+svc_tcp_has_wspace(struct svc_sock *svsk)
+{
+   /*
+* Set the SOCK_NOSPACE flag before checking the available
+* sock space.
+*/
+   set_bit(SOCK_NOSPACE, svsk-sk_sock-flags);
+   return svc_sock_has_write_space(svsk, sk_stream_wspace(svsk-sk_sk));
+}
+
 static const struct svc_xprt svc_tcp_xprt = {
.xpt_name = tcp,
.xpt_recvfrom = svc_tcp_recvfrom,
@@ -1347,6 +1367,7 @@ static const struct svc_xprt svc_tcp_xpr
.xpt_detach = svc_sock_detach,
.xpt_free = svc_sock_free,
.xpt_prep_reply_hdr

[ofa-general] [RFC, PATCH 06/20] svc: export svc_sock_enqueue, svc_sock_received

2007-08-20 Thread Tom Tucker


Export svc_sock_enqueue() and svc_sock_received() so they
can be used by sunrpc server transport implementations
(even future modular ones).

Signed-off-by: Greg Banks [EMAIL PROTECTED]
Signed-off-by: Peter Leckie [EMAIL PROTECTED]
Signed-off-by: Tom Tucker [EMAIL PROTECTED]
---

 include/linux/sunrpc/svcsock.h |2 ++
 net/sunrpc/svcsock.c   |7 ---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index 4e24e6d..0145057 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -108,6 +108,8 @@ int svc_addsock(struct svc_serv *serv,
int fd,
char *name_return,
int *proto);
+void   svc_sock_enqueue(struct svc_sock *svsk);
+void   svc_sock_received(struct svc_sock *svsk);
 
 /*
  * svc_makesock socket characteristics
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 0dc94a8..8fad53d 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -209,7 +209,7 @@ svc_release_skb(struct svc_rqst *rqstp)
  * processes, wake 'em up.
  *
  */
-static void
+void
 svc_sock_enqueue(struct svc_sock *svsk)
 {
struct svc_serv *serv = svsk-sk_server;
@@ -287,6 +287,7 @@ svc_sock_enqueue(struct svc_sock *svsk)
 out_unlock:
spin_unlock_bh(pool-sp_lock);
 }
+EXPORT_SYMBOL_GPL(svc_sock_enqueue);
 
 /*
  * Dequeue the first socket.  Must be called with the pool-sp_lock held.
@@ -315,14 +316,14 @@ svc_sock_dequeue(struct svc_pool *pool)
  * Note: SK_DATA only gets cleared when a read-attempt finds
  * no (or insufficient) data.
  */
-static inline void
+void
 svc_sock_received(struct svc_sock *svsk)
 {
svsk-sk_pool = NULL;
clear_bit(SK_BUSY, svsk-sk_flags);
svc_sock_enqueue(svsk);
 }
-
+EXPORT_SYMBOL_GPL(svc_sock_received);
 
 /**
  * svc_reserve - change the space reserved for the reply to a request.
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [RFC,PATCH 10/20] svc: Add generic refcount services

2007-08-20 Thread Tom Tucker


Add inline svc_sock_get() so that service transport code will not
need to manipulate sk_inuse directly.  Also, make svc_sock_put()
available so that transport code outside svcsock.c can use it.

Signed-off-by: Greg Banks [EMAIL PROTECTED]
Signed-off-by: Tom Tucker [EMAIL PROTECTED]
---

 include/linux/sunrpc/svcsock.h |   15 +++
 net/sunrpc/svcsock.c   |   29 ++---
 2 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index ea8b62b..9f37f30 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -115,6 +115,7 @@ int svc_addsock(struct svc_serv *serv,
int *proto);
 void   svc_sock_enqueue(struct svc_sock *svsk);
 void   svc_sock_received(struct svc_sock *svsk);
+void   __svc_sock_put(struct svc_sock *svsk);
 
 /*
  * svc_makesock socket characteristics
@@ -123,4 +124,18 @@ #define SVC_SOCK_DEFAULTS  (0U)
 #define SVC_SOCK_ANONYMOUS (1U  0)   /* don't register with pmap */
 #define SVC_SOCK_TEMPORARY (1U  1)   /* flag socket as temporary */
 
+/*
+ * Take and drop a temporary reference count on the svc_sock.
+ */
+static inline void svc_sock_get(struct svc_sock *svsk)
+{
+   atomic_inc(svsk-sk_inuse);
+}
+
+static inline void svc_sock_put(struct svc_sock *svsk)
+{
+   if (atomic_dec_and_test(svsk-sk_inuse))
+   __svc_sock_put(svsk);
+}
+
 #endif /* SUNRPC_SVCSOCK_H */
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index dcb5c7a..02f682a 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -273,7 +273,7 @@ svc_sock_enqueue(struct svc_sock *svsk)
svc_sock_enqueue: server %p, rq_sock=%p!\n,
rqstp, rqstp-rq_sock);
rqstp-rq_sock = svsk;
-   atomic_inc(svsk-sk_inuse);
+   svc_sock_get(svsk);
rqstp-rq_reserved = serv-sv_max_mesg;
atomic_add(rqstp-rq_reserved, svsk-sk_reserved);
BUG_ON(svsk-sk_pool != pool);
@@ -351,17 +351,16 @@ void svc_reserve(struct svc_rqst *rqstp,
 /*
  * Release a socket after use.
  */
-static inline void
-svc_sock_put(struct svc_sock *svsk)
+void
+__svc_sock_put(struct svc_sock *svsk)
 {
-   if (atomic_dec_and_test(svsk-sk_inuse)) {
-   BUG_ON(! test_bit(SK_DEAD, svsk-sk_flags));
+   BUG_ON(! test_bit(SK_DEAD, svsk-sk_flags));
 
-   if (svsk-sk_info_authunix != NULL)
-   svcauth_unix_info_release(svsk-sk_info_authunix);
-   svsk-sk_xprt-xpt_free(svsk);
-   }
+   if (svsk-sk_info_authunix != NULL)
+   svcauth_unix_info_release(svsk-sk_info_authunix);
+   svsk-sk_xprt-xpt_free(svsk);
 }
+EXPORT_SYMBOL_GPL(__svc_sock_put);
 
 static void
 svc_sock_release(struct svc_rqst *rqstp)
@@ -1109,7 +1108,7 @@ svc_tcp_accept(struct svc_sock *svsk)
  struct svc_sock,
  sk_list);
set_bit(SK_CLOSE, svsk-sk_flags);
-   atomic_inc(svsk-sk_inuse);
+   svc_sock_get(svsk);
}
spin_unlock_bh(serv-sv_lock);
 
@@ -1481,7 +1480,7 @@ svc_recv(struct svc_rqst *rqstp, long ti
spin_lock_bh(pool-sp_lock);
if ((svsk = svc_sock_dequeue(pool)) != NULL) {
rqstp-rq_sock = svsk;
-   atomic_inc(svsk-sk_inuse);
+   svc_sock_get(svsk);
rqstp-rq_reserved = serv-sv_max_mesg;
atomic_add(rqstp-rq_reserved, svsk-sk_reserved);
} else {
@@ -1620,7 +1619,7 @@ svc_age_temp_sockets(unsigned long closu
continue;
if (atomic_read(svsk-sk_inuse) || test_bit(SK_BUSY, 
svsk-sk_flags))
continue;
-   atomic_inc(svsk-sk_inuse);
+   svc_sock_get(svsk);
list_move(le, to_be_aged);
set_bit(SK_CLOSE, svsk-sk_flags);
set_bit(SK_DETACHED, svsk-sk_flags);
@@ -1868,7 +1867,7 @@ svc_delete_socket(struct svc_sock *svsk)
 */
if (!test_and_set_bit(SK_DEAD, svsk-sk_flags)) {
BUG_ON(atomic_read(svsk-sk_inuse)2);
-   atomic_dec(svsk-sk_inuse);
+   svc_sock_put(svsk);
if (test_bit(SK_TEMP, svsk-sk_flags))
serv-sv_tmpcnt--;
}
@@ -1883,7 +1882,7 @@ static void svc_close_socket(struct svc_
/* someone else will have to effect the close */
return;
 
-   atomic_inc(svsk-sk_inuse);
+   svc_sock_get(svsk);
svc_delete_socket(svsk);
clear_bit(SK_BUSY, svsk-sk_flags);
svc_sock_put(svsk);
@@ -1976,7 +1975,7 @@ svc_defer(struct cache_req *req)
dr-argslen = rqstp-rq_arg.len  2;
memcpy(dr

[ofa-general] [RFC,PATCH 13/20] svc: Add svc_[un]register_transport

2007-08-20 Thread Tom Tucker


Add an exported function for transport modules to [un]register themselves
with the sunrpc server side transport switch.

Signed-off-by: Tom Tucker [EMAIL PROTECTED]
---

 include/linux/sunrpc/svcsock.h |6 +
 net/sunrpc/svcsock.c   |   50 
 2 files changed, 56 insertions(+), 0 deletions(-)

diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index 7def951..cc911ab 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -13,6 +13,7 @@ #include linux/sunrpc/svc.h
 
 struct svc_xprt {
const char  *xpt_name;
+   struct module   *xpt_owner;
int (*xpt_recvfrom)(struct svc_rqst *rqstp);
int (*xpt_sendto)(struct svc_rqst *rqstp);
/*
@@ -45,7 +46,10 @@ struct svc_xprt {
 * Accept a pending connection, for connection-oriented transports
 */
int (*xpt_accept)(struct svc_sock *svsk);
+   /* Transport list link */
+   struct list_headxpt_list;
 };
+extern struct list_head svc_transport_list;
 
 /*
  * RPC server socket.
@@ -102,6 +106,8 @@ #define SK_LISTENER 11  /* 
listener (e.
 /*
  * Function prototypes.
  */
+intsvc_register_transport(struct svc_xprt *xprt);
+intsvc_unregister_transport(struct svc_xprt *xprt);
 intsvc_makesock(struct svc_serv *, int, unsigned short, int flags);
 void   svc_force_close_socket(struct svc_sock *);
 intsvc_recv(struct svc_rqst *, long);
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 6acf22f..6183951 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -91,6 +91,54 @@ static struct svc_deferred_req *svc_defe
 static int svc_deferred_recv(struct svc_rqst *rqstp);
 static struct cache_deferred_req *svc_defer(struct cache_req *req);
 
+/* List of registered transports */
+static spinlock_t svc_transport_lock = SPIN_LOCK_UNLOCKED;
+LIST_HEAD(svc_transport_list);
+
+int svc_register_transport(struct svc_xprt *xprt)
+{
+   struct svc_xprt *ops;
+   int res;
+
+   dprintk(svc: Adding svc transport '%s'\n,
+   xprt-xpt_name);
+
+   res = -EEXIST;
+   INIT_LIST_HEAD(xprt-xpt_list);
+   spin_lock(svc_transport_lock);
+   list_for_each_entry(ops, svc_transport_list, xpt_list) {
+   if (xprt == ops)
+   goto out;
+   }
+   list_add_tail(xprt-xpt_list, svc_transport_list);
+   res = 0;
+out:
+   spin_unlock(svc_transport_lock);
+   return res;
+}
+EXPORT_SYMBOL_GPL(svc_register_transport);
+
+int svc_unregister_transport(struct svc_xprt *xprt)
+{
+   struct svc_xprt *ops;
+   int res = 0;
+
+   dprintk(svc: Removing svc transport '%s'\n, xprt-xpt_name);
+
+   spin_lock(svc_transport_lock);
+   list_for_each_entry(ops, svc_transport_list, xpt_list) {
+   if (xprt == ops) {
+   list_del_init(ops-xpt_list);
+   goto out;
+   }
+   }
+   res = -ENOENT;
+ out:
+   spin_unlock(svc_transport_lock);
+   return res;
+}
+EXPORT_SYMBOL_GPL(svc_unregister_transport);
+
 /* apparently the standard is that clients close
  * idle connections after 5 minutes, servers after
  * 6 minutes
@@ -887,6 +935,7 @@ svc_udp_has_wspace(struct svc_sock *svsk
 
 static const struct svc_xprt svc_udp_xprt = {
.xpt_name = udp,
+   .xpt_owner = THIS_MODULE,
.xpt_recvfrom = svc_udp_recvfrom,
.xpt_sendto = svc_udp_sendto,
.xpt_detach = svc_sock_detach,
@@ -1346,6 +1395,7 @@ svc_tcp_has_wspace(struct svc_sock *svsk
 
 static const struct svc_xprt svc_tcp_xprt = {
.xpt_name = tcp,
+   .xpt_owner = THIS_MODULE,
.xpt_recvfrom = svc_tcp_recvfrom,
.xpt_sendto = svc_tcp_sendto,
.xpt_detach = svc_sock_detach,
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [RFC,PATCH 14/20] svc: Register TCP/UDP Transports

2007-08-20 Thread Tom Tucker


Add a call to svc_register_transport for the built
in transports UDP and TCP. The registration is done in the 
sunrpc module initialization logic.
Signed-off-by: Tom Tucker [EMAIL PROTECTED]
---

 net/sunrpc/sunrpc_syms.c |2 ++
 net/sunrpc/svcsock.c |   10 --
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/sunrpc_syms.c b/net/sunrpc/sunrpc_syms.c
index 73075de..c68577b 100644
--- a/net/sunrpc/sunrpc_syms.c
+++ b/net/sunrpc/sunrpc_syms.c
@@ -134,6 +134,7 @@ EXPORT_SYMBOL(nfsd_debug);
 EXPORT_SYMBOL(nlm_debug);
 #endif
 
+extern void init_svc_xprt(void);
 extern struct cache_detail ip_map_cache, unix_gid_cache;
 
 static int __init
@@ -156,6 +157,7 @@ #endif
cache_register(ip_map_cache);
cache_register(unix_gid_cache);
init_socket_xprt();
+   init_svc_xprt();
 out:
return err;
 }
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 6183951..d6443e8 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -933,7 +933,7 @@ svc_udp_has_wspace(struct svc_sock *svsk
return svc_sock_has_write_space(svsk, sock_wspace(svsk-sk_sk));
 }
 
-static const struct svc_xprt svc_udp_xprt = {
+static struct svc_xprt svc_udp_xprt = {
.xpt_name = udp,
.xpt_owner = THIS_MODULE,
.xpt_recvfrom = svc_udp_recvfrom,
@@ -1393,7 +1393,7 @@ svc_tcp_has_wspace(struct svc_sock *svsk
return svc_sock_has_write_space(svsk, sk_stream_wspace(svsk-sk_sk));
 }
 
-static const struct svc_xprt svc_tcp_xprt = {
+static struct svc_xprt svc_tcp_xprt = {
.xpt_name = tcp,
.xpt_owner = THIS_MODULE,
.xpt_recvfrom = svc_tcp_recvfrom,
@@ -1406,6 +1406,12 @@ static const struct svc_xprt svc_tcp_xpr
.xpt_accept = svc_tcp_accept,
 };
 
+void init_svc_xprt(void)
+{
+   svc_register_transport(svc_udp_xprt);
+   svc_register_transport(svc_tcp_xprt);
+}
+
 static void
 svc_tcp_init_listener(struct svc_sock *svsk)
 {
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [RFC,PATCH 15/20] svc: transport file implementation

2007-08-20 Thread Tom Tucker


Create a proc/sys/sunrpc/transport file that contains information 
about the currently registered transports.

Signed-off-by: Tom Tucker [EMAIL PROTECTED]
---

 include/linux/sunrpc/debug.h |1 +
 net/sunrpc/svcsock.c |   28 
 net/sunrpc/sysctl.c  |   40 +++-
 3 files changed, 68 insertions(+), 1 deletions(-)

diff --git a/include/linux/sunrpc/debug.h b/include/linux/sunrpc/debug.h
index 10709cb..89458df 100644
--- a/include/linux/sunrpc/debug.h
+++ b/include/linux/sunrpc/debug.h
@@ -88,6 +88,7 @@ enum {
CTL_SLOTTABLE_TCP,
CTL_MIN_RESVPORT,
CTL_MAX_RESVPORT,
+   CTL_TRANSPORTS,
 };
 
 #endif /* _LINUX_SUNRPC_DEBUG_H_ */
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index d6443e8..276737e 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -139,6 +139,34 @@ int svc_unregister_transport(struct svc_
 }
 EXPORT_SYMBOL_GPL(svc_unregister_transport);
 
+/*
+ * Format the transport list for printing
+ */
+int svc_print_transports(char *buf, int maxlen)
+{
+   struct list_head *le;
+   char tmpstr[80];
+   int len = 0;
+   buf[0] = '\0';
+
+   spin_lock(svc_transport_lock);
+   list_for_each(le, svc_transport_list) {
+   int slen;
+   struct svc_xprt *xprt =
+   list_entry(le, struct svc_xprt, xpt_list);
+
+   sprintf(tmpstr, %s %d\n, xprt-xpt_name, 
xprt-xpt_max_payload);
+   slen = strlen(tmpstr);
+   if (len + slen  maxlen)
+   break;
+   len += slen;
+   strcat(buf, tmpstr);
+   }
+   spin_unlock(svc_transport_lock);
+
+   return len;
+}
+
 /* apparently the standard is that clients close
  * idle connections after 5 minutes, servers after
  * 6 minutes
diff --git a/net/sunrpc/sysctl.c b/net/sunrpc/sysctl.c
index 738db32..683cf90 100644
--- a/net/sunrpc/sysctl.c
+++ b/net/sunrpc/sysctl.c
@@ -27,6 +27,9 @@ unsigned int  nfs_debug;
 unsigned int   nfsd_debug;
 unsigned int   nlm_debug;
 
+/* Transport string */
+char   xprt_buf[128];
+
 #ifdef RPC_DEBUG
 
 static struct ctl_table_header *sunrpc_table_header;
@@ -48,6 +51,34 @@ rpc_unregister_sysctl(void)
}
 }
 
+int svc_print_transports(char *buf, int maxlen);
+static int proc_do_xprt(ctl_table *table, int write, struct file *file,
+   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   char tmpbuf[128];
+   int len;
+   if ((*ppos  !write) || !*lenp) {
+   *lenp = 0;
+   return 0;
+   }
+
+   if (write) 
+   return -EINVAL;
+   else {
+
+   len = svc_print_transports(tmpbuf, 128);
+   if (!access_ok(VERIFY_WRITE, buffer, len))
+   return -EFAULT;
+
+   if (__copy_to_user(buffer, tmpbuf, len))
+   return -EFAULT;
+   }
+
+   *lenp -= len;
+   *ppos += len;
+   return 0;
+}
+
 static int
 proc_dodebug(ctl_table *table, int write, struct file *file,
void __user *buffer, size_t *lenp, loff_t *ppos)
@@ -111,7 +142,6 @@ done:
return 0;
 }
 
-
 static ctl_table debug_table[] = {
{
.ctl_name   = CTL_RPCDEBUG,
@@ -145,6 +175,14 @@ static ctl_table debug_table[] = {
.mode   = 0644,
.proc_handler   = proc_dodebug
},
+   {
+   .ctl_name   = CTL_TRANSPORTS,
+   .procname   = transports,
+   .data   = xprt_buf,
+   .maxlen = sizeof(xprt_buf),
+   .mode   = 0444,
+   .proc_handler   = proc_do_xprt,
+   },
{ .ctl_name = 0 }
 };
 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [RFC, PATCH 20/20] knfsd: create listener via portlist write

2007-08-20 Thread Tom Tucker


Update the write handler for the portlist file to allow creating new 
listening endpoints on a transport. The general form of the string is:

transport_namespaceport number

For example:

tcp 2049

This is intended to support the creation of a listening endpoint for
RDMA transports without adding #ifdef code to the nfssvc.c file.
The built-in transports UDP/TCP were left in the nfssvc initialization
code to avoid having to change rpc.nfsd, etc...

Signed-off-by: Tom Tucker [EMAIL PROTECTED]
---

 fs/nfsd/nfsctl.c |   17 +
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index 71c686d..da2abda 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -555,6 +555,23 @@ static ssize_t write_ports(struct file *
kfree(toclose);
return len;
}
+   /* This implements the ability to add a transport by writing
+* it's transport name to the portlist file
+*/
+   if (isalnum(buf[0])) {
+   int err;
+   char transport[16];
+   int port;
+   if (sscanf(buf, %15s %4d, transport, port) == 2) {
+   err = nfsd_create_serv();
+   if (!err)
+   err = svc_create_svcsock(nfsd_serv,
+transport, port,
+SVC_SOCK_ANONYMOUS);
+   return err  0 ? err : 0;
+   }
+   }
+   
return -EINVAL;
 }
 
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [RFC,PATCH 01/10] rdma: ONCRPC RDMA Header File

2007-08-20 Thread Tom Tucker


These are the core data types that are used to process the ONCRPC protocol
in the NFS-RDMA client and server.

Signed-off-by: Tom Talpey [EMAIL PROTECTED]
---

 include/linux/sunrpc/rpc_rdma.h |  116 +++
 1 files changed, 116 insertions(+), 0 deletions(-)

diff --git a/include/linux/sunrpc/rpc_rdma.h b/include/linux/sunrpc/rpc_rdma.h
new file mode 100644
index 000..0013a0d
--- /dev/null
+++ b/include/linux/sunrpc/rpc_rdma.h
@@ -0,0 +1,116 @@
+/*
+ * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the BSD-type
+ * license below:
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  Redistributions of source code must retain the above copyright
+ *  notice, this list of conditions and the following disclaimer.
+ *
+ *  Redistributions in binary form must reproduce the above
+ *  copyright notice, this list of conditions and the following
+ *  disclaimer in the documentation and/or other materials provided
+ *  with the distribution.
+ *
+ *  Neither the name of the Network Appliance, Inc. nor the names of
+ *  its contributors may be used to endorse or promote products
+ *  derived from this software without specific prior written
+ *  permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _LINUX_SUNRPC_RPC_RDMA_H
+#define _LINUX_SUNRPC_RPC_RDMA_H
+
+struct rpcrdma_segment {
+   uint32_t rs_handle; /* Registered memory handle */
+   uint32_t rs_length; /* Length of the chunk in bytes */
+   uint64_t rs_offset; /* Chunk virtual address or offset */
+};
+
+/*
+ * read chunk(s), encoded as a linked list.
+ */
+struct rpcrdma_read_chunk {
+   uint32_t rc_discrim;/* 1 indicates presence */
+   uint32_t rc_position;   /* Position in XDR stream */
+   struct rpcrdma_segment rc_target;
+};
+
+/*
+ * write chunk, and reply chunk.
+ */
+struct rpcrdma_write_chunk {
+   struct rpcrdma_segment wc_target;
+};
+
+/*
+ * write chunk(s), encoded as a counted array.
+ */
+struct rpcrdma_write_array {
+   uint32_t wc_discrim;/* 1 indicates presence */
+   uint32_t wc_nchunks;/* Array count */
+   struct rpcrdma_write_chunk wc_array[0];
+};
+
+struct rpcrdma_msg {
+   uint32_t rm_xid;/* Mirrors the RPC header xid */
+   uint32_t rm_vers;   /* Version of this protocol */
+   uint32_t rm_credit; /* Buffers requested/granted */
+   uint32_t rm_type;   /* Type of message (enum rpcrdma_proc) */
+   union {
+
+   struct {/* no chunks */
+   uint32_t rm_empty[3];   /* 3 empty chunk lists */
+   } rm_nochunks;
+
+   struct {/* no chunks and padded */
+   uint32_t rm_align;  /* Padding alignment */
+   uint32_t rm_thresh; /* Padding threshold */
+   uint32_t rm_pempty[3];  /* 3 empty chunk lists */
+   } rm_padded;
+
+   uint32_t rm_chunks[0];  /* read, write and reply chunks */
+
+   } rm_body;
+};
+
+#define RPCRDMA_HDRLEN_MIN 28
+
+enum rpcrdma_errcode {
+   ERR_VERS = 1,
+   ERR_CHUNK = 2
+};
+
+struct rpcrdma_err_vers {
+   uint32_t rdma_vers_low; /* Version range supported by peer */
+   uint32_t rdma_vers_high;
+};
+
+enum rpcrdma_proc {
+   RDMA_MSG = 0,   /* An RPC call or reply msg */
+   RDMA_NOMSG = 1, /* An RPC call or reply msg - separate body */
+   RDMA_MSGP = 2,  /* An RPC call or reply msg with padding */
+   RDMA_DONE = 3,  /* Client signals reply completion */
+   RDMA_ERROR = 4  /* An RPC RDMA encoding error */
+};
+
+#endif /* _LINUX_SUNRPC_RPC_RDMA_H */

[ofa-general] [RFC,PATCH 03/10] rdma: SVCRMDA Header File

2007-08-20 Thread Tom Tucker


This file defines the data types used by the SVCRDMA transport module.
The principle data structure is the transport specific extension to 
the svcxprt structure.

Signed-off-by: Tom Tucker [EMAIL PROTECTED]
---

 include/linux/sunrpc/svc_rdma.h |  261 +++
 1 files changed, 261 insertions(+), 0 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
new file mode 100644
index 000..0bad94b
--- /dev/null
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -0,0 +1,261 @@
+/*
+ * Copyright (c) 2005-2006 Network Appliance, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the BSD-type
+ * license below:
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  Redistributions of source code must retain the above copyright
+ *  notice, this list of conditions and the following disclaimer.
+ *
+ *  Redistributions in binary form must reproduce the above
+ *  copyright notice, this list of conditions and the following
+ *  disclaimer in the documentation and/or other materials provided
+ *  with the distribution.
+ *
+ *  Neither the name of the Network Appliance, Inc. nor the names of
+ *  its contributors may be used to endorse or promote products
+ *  derived from this software without specific prior written
+ *  permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * Author: Tom Tucker [EMAIL PROTECTED]
+ */
+
+#ifndef SVC_RDMA_H
+#define SVC_RDMA_H
+#include linux/sunrpc/xdr.h
+#include linux/sunrpc/svcsock.h
+#include linux/sunrpc/rpc_rdma.h
+#include rdma/ib_verbs.h
+#include rdma/rdma_cm.h
+#define SVCRDMA_DEBUG
+
+/* RPC/RDMA parameters */
+extern unsigned int svcrdma_ord;
+extern unsigned int svcrdma_max_requests;
+extern unsigned int svcrdma_max_req_size;
+extern unsigned int rdma_stat_recv;
+extern unsigned int rdma_stat_read;
+extern unsigned int rdma_stat_write;
+extern unsigned int rdma_stat_sq_starve;
+extern unsigned int rdma_stat_rq_starve;
+extern unsigned int rdma_stat_rq_poll;
+extern unsigned int rdma_stat_rq_prod;
+extern unsigned int rdma_stat_sq_poll;
+extern unsigned int rdma_stat_sq_prod;
+
+#define RPCRDMA_VERSION 1
+
+/*
+ * Contexts are built when an RDMA request is created and are a
+ * record of the resources that can be recovered when the request
+ * completes.
+ */
+struct svc_rdma_op_ctxt {
+   struct svc_rdma_op_ctxt *next;
+   struct xdr_buf arg;
+   struct list_head dto_q;
+   enum ib_wr_opcode wr_op;
+   enum ib_wc_status wc_status;
+   u32 byte_len;
+   struct svcxprt_rdma *xprt;
+   unsigned long flags;
+   enum dma_data_direction direction;
+   int count;
+   struct ib_sge sge[RPCSVC_MAXPAGES];
+   struct page *pages[RPCSVC_MAXPAGES];
+};
+
+#define RDMACTXT_F_READ_DONE   1
+#define RDMACTXT_F_LAST_CTXT   2
+
+struct svc_rdma_deferred_req {
+   struct svc_deferred_req req;
+   struct page *arg_page;
+   int arg_len;
+};
+
+struct svcxprt_rdma {
+   struct svc_sock  sc_xprt;  /* SVC transport structure */
+   struct rdma_cm_id*sc_cm_id;/* RDMA connection id */
+   struct list_head sc_accept_q;  /* Conn. waiting accept */
+   int  sc_ord;   /* RDMA read limit */
+   wait_queue_head_tsc_read_wait;
+   int  sc_max_sge;
+
+   int  sc_sq_depth;  /* Depth of SQ */
+   atomic_t sc_sq_count;  /* Number of SQ WR on queue */
+
+   int  sc_max_requests;  /* Depth of RQ */
+   int  sc_max_req_size;  /* Size of each RQ WR buf */
+
+   struct ib_pd *sc_pd;
+
+   struct svc_rdma_op_ctxt  *sc_ctxt_head;
+   int  sc_ctxt_cnt;
+   int  sc_ctxt_bump;
+   int  sc_ctxt_max

[ofa-general] [RFC,PATCH 04/10] rdma: SVCRDMA Transport Module

2007-08-20 Thread Tom Tucker


This file implements the RDMA transport module initialization and
termination logic and registers the transport sysctl variables.

Signed-off-by: Tom Tucker [EMAIL PROTECTED]
---

 net/sunrpc/svc_rdma.c |  270 +
 1 files changed, 270 insertions(+), 0 deletions(-)

diff --git a/net/sunrpc/svc_rdma.c b/net/sunrpc/svc_rdma.c
new file mode 100644
index 000..620249d
--- /dev/null
+++ b/net/sunrpc/svc_rdma.c
@@ -0,0 +1,270 @@
+/*
+ * Copyright (c) 2005-2006 Network Appliance, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the BSD-type
+ * license below:
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  Redistributions of source code must retain the above copyright
+ *  notice, this list of conditions and the following disclaimer.
+ *
+ *  Redistributions in binary form must reproduce the above
+ *  copyright notice, this list of conditions and the following
+ *  disclaimer in the documentation and/or other materials provided
+ *  with the distribution.
+ *
+ *  Neither the name of the Network Appliance, Inc. nor the names of
+ *  its contributors may be used to endorse or promote products
+ *  derived from this software without specific prior written
+ *  permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * Author: Tom Tucker [EMAIL PROTECTED]
+ */
+#include linux/module.h
+#include linux/init.h
+#include linux/fs.h
+#include linux/sysctl.h
+#include linux/sunrpc/clnt.h
+#include linux/sunrpc/sched.h
+#include linux/sunrpc/svc_rdma.h
+
+#define RPCDBG_FACILITYRPCDBG_SVCXPRT
+
+/* RPC/RDMA parameters */
+unsigned int svcrdma_ord = RPCRDMA_ORD;
+static unsigned int min_ord = 1;
+static unsigned int max_ord = 4096;
+unsigned int svcrdma_max_requests = RPCRDMA_MAX_REQUESTS;
+static unsigned int min_max_requests = 4;
+static unsigned int max_max_requests = 16384;
+unsigned int svcrdma_max_req_size = RPCRDMA_MAX_REQ_SIZE;
+static unsigned int min_max_inline = 4096;
+static unsigned int max_max_inline = 65536;
+static unsigned int zero = 0;
+static unsigned int one = 1;
+
+unsigned int rdma_stat_recv = 0;
+unsigned int rdma_stat_read = 0;
+unsigned int rdma_stat_write = 0;
+unsigned int rdma_stat_sq_starve = 0;
+unsigned int rdma_stat_rq_starve = 0;
+unsigned int rdma_stat_rq_poll = 0;
+unsigned int rdma_stat_rq_prod = 0;
+unsigned int rdma_stat_sq_poll = 0;
+unsigned int rdma_stat_sq_prod = 0;
+
+extern struct svc_xprt svc_rdma_xprt;
+
+static struct ctl_table_header *svcrdma_table_header;
+static ctl_table svcrdma_parm_table[] = {
+   {
+   .ctl_name   = CTL_RDMA_MAX_REQUESTS,
+   .procname   = max_requests,
+   .data   = svcrdma_max_requests,
+   .maxlen = sizeof(unsigned int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .strategy   = sysctl_intvec,
+   .extra1 = min_max_requests,
+   .extra2 = max_max_requests
+   },
+   {
+   .ctl_name   = CTL_RDMA_MAX_REQ_SIZE,
+   .procname   = max_req_size,
+   .data   = svcrdma_max_req_size,
+   .maxlen = sizeof(unsigned int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .strategy   = sysctl_intvec,
+   .extra1 = min_max_inline,
+   .extra2 = max_max_inline
+   },
+   {
+   .ctl_name   = CTL_RDMA_ORD,
+   .procname   = max_outbound_read_requests,
+   .data   = svcrdma_ord,
+   .maxlen = sizeof(unsigned int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax

[ofa-general] [RFC, PATCH 05/10] rdma: SVCRDMA Core Transport Services

2007-08-20 Thread Tom Tucker


This file implements the core transport data management and I/O
path. The I/O path for RDMA involves receiving callbacks on interrupt
context. Since all the svc transport locks are _bh locks we enqueue the
transport on a list, schedule a tasklet to dequeue data indications from
the RDMA completion queue. The tasklet in turn takes _bh locks to
enqueue receive data indications on a list for the transport. The
svc_rdma_recvfrom transport function dequeues data from this list in an
NFSD thread context.

Signed-off-by: Tom Tucker [EMAIL PROTECTED]
---

 net/sunrpc/svc_rdma_transport.c | 1207 +++
 net/sunrpc/svcauth_unix.c   |3 
 2 files changed, 1208 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/svc_rdma_transport.c b/net/sunrpc/svc_rdma_transport.c
new file mode 100644
index 000..3f1f251
--- /dev/null
+++ b/net/sunrpc/svc_rdma_transport.c
@@ -0,0 +1,1207 @@
+/*
+ * Copyright (c) 2005-2006 Network Appliance, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the BSD-type
+ * license below:
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  Redistributions of source code must retain the above copyright
+ *  notice, this list of conditions and the following disclaimer.
+ *
+ *  Redistributions in binary form must reproduce the above
+ *  copyright notice, this list of conditions and the following
+ *  disclaimer in the documentation and/or other materials provided
+ *  with the distribution.
+ *
+ *  Neither the name of the Network Appliance, Inc. nor the names of
+ *  its contributors may be used to endorse or promote products
+ *  derived from this software without specific prior written
+ *  permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * Author: Tom Tucker [EMAIL PROTECTED]
+ */
+
+#include asm/semaphore.h
+#include linux/device.h
+#include linux/in.h
+#include linux/err.h
+#include linux/time.h
+#include linux/delay.h
+
+#include linux/sunrpc/svcsock.h
+#include linux/sunrpc/debug.h
+#include linux/sunrpc/rpc_rdma.h
+#include linux/mm.h  /* num_physpages */
+#include linux/spinlock.h
+#include linux/net.h
+#include net/sock.h
+#include asm/io.h
+#include rdma/ib_verbs.h
+#include rdma/rdma_cm.h
+#include net/ipv6.h
+#include linux/sunrpc/svc_rdma.h
+
+#define RPCDBG_FACILITYRPCDBG_SVCXPRT
+
+int svc_rdma_create_svc(struct svc_serv *serv, struct sockaddr *sa, int flags);
+static int svc_rdma_accept(struct svc_sock *xprt);
+static void rdma_destroy_xprt(struct svcxprt_rdma *xprt);
+static void dto_tasklet_func(unsigned long data);
+static struct cache_deferred_req *svc_rdma_defer(struct cache_req *req);
+static void svc_rdma_detach(struct svc_sock *svsk);
+static void svc_rdma_free(struct svc_sock *svsk);
+static int svc_rdma_has_wspace(struct svc_sock *svsk);
+static int svc_rdma_get_name(char *buf, struct svc_sock *svsk);
+
+static void rq_cq_reap(struct svcxprt_rdma *xprt);
+static void sq_cq_reap(struct svcxprt_rdma *xprt);
+
+DECLARE_TASKLET(dto_tasklet, dto_tasklet_func, 0UL);
+static spinlock_t dto_lock = SPIN_LOCK_UNLOCKED;
+static LIST_HEAD(dto_xprt_q);
+
+struct svc_xprt svc_rdma_xprt = {
+   .xpt_name = rdma,
+   .xpt_owner = THIS_MODULE,
+   .xpt_create_svc = svc_rdma_create_svc,
+   .xpt_get_name = svc_rdma_get_name,
+   .xpt_recvfrom = svc_rdma_recvfrom,
+   .xpt_sendto = svc_rdma_sendto,
+   .xpt_detach = svc_rdma_detach,
+   .xpt_free = svc_rdma_free,
+   .xpt_has_wspace = svc_rdma_has_wspace,
+   .xpt_max_payload = RPCSVC_MAXPAYLOAD_TCP,
+   .xpt_accept = svc_rdma_accept,
+   .xpt_defer = svc_rdma_defer
+};
+
+static int rdma_bump_context_cache(struct svcxprt_rdma *xprt)
+{
+   int target;
+   int at_least_one = 0;
+   struct svc_rdma_op_ctxt *ctxt;
+
+   target = min(xprt-sc_ctxt_cnt + xprt-sc_ctxt_bump

[ofa-general] [RFC,PATCH 06/10] rdma: SVCRDMA recvfrom

2007-08-20 Thread Tom Tucker


This file implements the RDMA transport recvfrom function. The function
dequeues work reqeust completion contexts from an I/O list that it shares
with the I/O tasklet in svc_rdma_transport.c. For ONCRPC RDMA, an RPC may
not be complete when it is received. Instead, the RDMA header that precedes
the RPC message informs the transport where to get the RPC data from on
the client and where to place it in the RPC message before it is delivered
to the server. The svc_rdma_recvfrom function therefore, parses this RDMA
header and issues any necessary RDMA operations to fetch the remainder of
the RPC from the client.

Special handling is required when the request involves an RDMA_READ
in this case, submits all of the RDMA_READ requests to the underlying
transport driver and then returns 0 (EAGAIN). When the transport
completes the last RDMA_READ for the request, it enqueues it on an
read completion queue and enqueues the transport. The recvfrom code
favors this queue over the regular DTO queue when satisfying reads.

Signed-off-by: Tom Tucker [EMAIL PROTECTED]
---

 net/sunrpc/svc_rdma_recvfrom.c |  664 
 1 files changed, 664 insertions(+), 0 deletions(-)

diff --git a/net/sunrpc/svc_rdma_recvfrom.c b/net/sunrpc/svc_rdma_recvfrom.c
new file mode 100644
index 000..681f25a
--- /dev/null
+++ b/net/sunrpc/svc_rdma_recvfrom.c
@@ -0,0 +1,664 @@
+/*
+ * Copyright (c) 2005-2006 Network Appliance, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the BSD-type
+ * license below:
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  Redistributions of source code must retain the above copyright
+ *  notice, this list of conditions and the following disclaimer.
+ *
+ *  Redistributions in binary form must reproduce the above
+ *  copyright notice, this list of conditions and the following
+ *  disclaimer in the documentation and/or other materials provided
+ *  with the distribution.
+ *
+ *  Neither the name of the Network Appliance, Inc. nor the names of
+ *  its contributors may be used to endorse or promote products
+ *  derived from this software without specific prior written
+ *  permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * Author: Tom Tucker [EMAIL PROTECTED]
+ */
+
+#include asm/semaphore.h
+#include linux/device.h
+#include linux/in.h
+#include linux/err.h
+#include linux/time.h
+
+#include linux/sunrpc/svcsock.h
+#include linux/sunrpc/debug.h
+#include linux/sunrpc/rpc_rdma.h
+#include linux/mm.h  /* num_physpages */
+#include linux/spinlock.h
+#include linux/net.h
+#include net/sock.h
+#include asm/io.h
+#include asm/unaligned.h
+#include rdma/ib_verbs.h
+#include rdma/rdma_cm.h
+#include linux/sunrpc/svc_rdma.h
+
+#define RPCDBG_FACILITYRPCDBG_SVCXPRT
+
+/*
+ * Replace the pages in the rq_argpages array with the pages from the SGE in
+ * the RDMA_RECV completion. The SGL should contain full pages up until the
+ * last one.
+ */
+static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
+  struct svc_rdma_op_ctxt *ctxt,
+  u32 byte_count)
+{
+   struct page *page;
+   u32 bc;
+   int sge_no;
+
+   /* Swap the page in the SGE with the page in argpages */
+   page = ctxt-pages[0];
+   put_page(rqstp-rq_pages[0]);
+   rqstp-rq_pages[0] = page;
+
+   /* Set up the XDR head */
+   rqstp-rq_arg.head[0].iov_base = page_address(page);
+   rqstp-rq_arg.head[0].iov_len = min(byte_count, ctxt-sge[0].length);
+   rqstp-rq_arg.len = byte_count;
+   rqstp-rq_arg.buflen = byte_count;
+
+   /* Compute bytes past head in the SGL */
+   bc = byte_count - rqstp-rq_arg.head[0].iov_len;
+
+   /* If data remains, store it in the pagelist */
+   rqstp-rq_arg.page_len = bc;
+   rqstp-rq_arg.page_base = 0

[ofa-general] [RFC,PATCH 07/10] rdma: SVCRDMA sendto

2007-08-20 Thread Tom Tucker


This file implements the RDMA transport sendto function. A RPC reply
on an RDMA transport consists of some number of RDMA_WRITE requests 
followed by an RDMA_SEND request. The sendto function parses the 
ONCRPC RDMA reply header to determine how to send the reply back to 
the client. The send queue is sized so as to be able to send complete 
replies for requests in most cases.  In the event that there are not 
enough SQ WR slots to reply, e.g.  big data, the send will block the 
NFSD thread. The I/O callback functions in svc_rdma_transport.c that 
reap WR completions wake any waiters blocked on the SQ. In general,
the goal is not to block NFSD threads and the has_wspace method
stall requests when the SQ is nearly full. 


Signed-off-by: Tom Tucker [EMAIL PROTECTED]
---

 net/sunrpc/svc_rdma_sendto.c |  515 ++
 1 files changed, 515 insertions(+), 0 deletions(-)

diff --git a/net/sunrpc/svc_rdma_sendto.c b/net/sunrpc/svc_rdma_sendto.c
new file mode 100644
index 000..cd4b5ac
--- /dev/null
+++ b/net/sunrpc/svc_rdma_sendto.c
@@ -0,0 +1,515 @@
+/*
+ * Copyright (c) 2005-2006 Network Appliance, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the BSD-type
+ * license below:
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  Redistributions of source code must retain the above copyright
+ *  notice, this list of conditions and the following disclaimer.
+ *
+ *  Redistributions in binary form must reproduce the above
+ *  copyright notice, this list of conditions and the following
+ *  disclaimer in the documentation and/or other materials provided
+ *  with the distribution.
+ *
+ *  Neither the name of the Network Appliance, Inc. nor the names of
+ *  its contributors may be used to endorse or promote products
+ *  derived from this software without specific prior written
+ *  permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * Author: Tom Tucker [EMAIL PROTECTED]
+ */
+
+#include asm/semaphore.h
+#include linux/device.h
+#include linux/in.h
+#include linux/err.h
+#include linux/time.h
+
+#include linux/sunrpc/svcsock.h
+#include linux/sunrpc/debug.h
+#include linux/sunrpc/rpc_rdma.h
+#include linux/mm.h  /* num_physpages */
+#include linux/spinlock.h
+#include linux/net.h
+#include net/sock.h
+#include asm/io.h
+#include asm/unaligned.h
+#include rdma/ib_verbs.h
+#include rdma/rdma_cm.h
+#include linux/sunrpc/svc_rdma.h
+
+#define RPCDBG_FACILITYRPCDBG_SVCXPRT
+
+/* Encode an XDR as an array of IB SGE
+ *
+ * Assumptions:
+ * - head[0] is physically contiguous.
+ * - tail[0] is physically contiguous.
+ * - pages[] is not physically or virtually contigous and consists of
+ *   PAGE_SIZE elements.
+ *
+ * Output:
+ * SGE[0]  reserved for RCPRDMA header
+ * SGE[1]  data from xdr-head[]
+ * SGE[2..sge_count-2] data from xdr-pages[]
+ * SGE[sge_count-1]data from xdr-tail.
+ *
+ */
+static struct ib_sge *xdr_to_sge(struct svcxprt_rdma *xprt,
+struct xdr_buf *xdr,
+struct ib_sge *sge,
+int *sge_count)
+{
+   /* Max we need is the length of the XDR / pagesize + one for
+* head + one for tail + one for RPCRDMA header
+*/
+   int sge_max = (xdr-len+PAGE_SIZE-1) / PAGE_SIZE + 3;
+int sge_no;
+   u32 byte_count = xdr-len;
+   u32 sge_bytes;
+   u32 page_bytes;
+   int page_off;
+   int page_no;
+
+   /* Skip the first sge, this is for the RPCRDMA header */
+   sge_no = 1;
+
+   /* Head SGE */
+   sge[sge_no].addr = ib_dma_map_single(xprt-sc_cm_id-device,
+xdr-head[0].iov_base,
+xdr-head[0].iov_len

[ofa-general] [RFC, PATCH 08/10] rdma: ONCRPC RDMA protocol marshalling

2007-08-20 Thread Tom Tucker


This logic parses the ONCRDMA protocol headers that
precede the actual RPC header. It is placed in a separate
file to keep all protocol aware code in a single place.

Signed-off-by: Tom Tucker [EMAIL PROTECTED]
---

 net/sunrpc/svc_rdma_marshal.c |  424 +
 1 files changed, 424 insertions(+), 0 deletions(-)

diff --git a/net/sunrpc/svc_rdma_marshal.c b/net/sunrpc/svc_rdma_marshal.c
new file mode 100644
index 000..feebabd
--- /dev/null
+++ b/net/sunrpc/svc_rdma_marshal.c
@@ -0,0 +1,424 @@
+/*
+ * Copyright (c) 2005-2006 Network Appliance, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the BSD-type
+ * license below:
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  Redistributions of source code must retain the above copyright
+ *  notice, this list of conditions and the following disclaimer.
+ *
+ *  Redistributions in binary form must reproduce the above
+ *  copyright notice, this list of conditions and the following
+ *  disclaimer in the documentation and/or other materials provided
+ *  with the distribution.
+ *
+ *  Neither the name of the Network Appliance, Inc. nor the names of
+ *  its contributors may be used to endorse or promote products
+ *  derived from this software without specific prior written
+ *  permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * Author: Tom Tucker [EMAIL PROTECTED]
+ */
+
+#include asm/semaphore.h
+#include linux/device.h
+#include linux/in.h
+#include linux/err.h
+#include linux/time.h
+
+#include rdma/rdma_cm.h
+
+#include linux/sunrpc/svcsock.h
+#include linux/sunrpc/debug.h
+#include linux/sunrpc/rpc_rdma.h
+#include linux/spinlock.h
+#include linux/net.h
+#include net/sock.h
+#include asm/io.h
+#include asm/unaligned.h
+#include rdma/rdma_cm.h
+#include rdma/ib_verbs.h
+#include linux/sunrpc/rpc_rdma.h
+#include linux/sunrpc/svc_rdma.h
+
+#define RPCDBG_FACILITYRPCDBG_SVCXPRT
+
+/*
+ * Decodes a read chunk list. The expected format is as follows:
+ *descrim  : xdr_one
+ *position : u32 offset into XDR stream
+ *handle   : u32 RKEY
+ *. . .
+ *  end-of-list: xdr_zero
+ */
+static u32 *decode_read_list(u32 *va, u32 *vaend)
+{
+   struct rpcrdma_read_chunk *ch = (struct rpcrdma_read_chunk*)va;
+
+   while (ch-rc_discrim != xdr_zero) {
+   u64 ch_offset;
+
+   if (((unsigned long)ch + sizeof(struct rpcrdma_read_chunk)) 
+   (unsigned long)vaend) {
+   dprintk(svcrdma: vaend=%p, ch=%p\n, vaend, ch);
+   return NULL;
+   }
+
+   ch-rc_discrim = ntohl(ch-rc_discrim);
+   ch-rc_position = ntohl(ch-rc_position);
+   ch-rc_target.rs_handle = ntohl(ch-rc_target.rs_handle);
+   ch-rc_target.rs_length = ntohl(ch-rc_target.rs_length);
+   va = (u32*)ch-rc_target.rs_offset;
+   xdr_decode_hyper(va, ch_offset);
+   put_unaligned(ch_offset, (u64*)va);
+   ch++;
+   }
+   return (u32*)ch-rc_position;
+}
+
+/*
+ * Determine number of chunks and total bytes in chunk list. The chunk
+ * list has already been verified to fit within the RPCRDMA header.
+ */
+void svc_rdma_rcl_chunk_counts(struct rpcrdma_read_chunk *ch,
+  int *ch_count, int *byte_count)
+{
+   /* compute the number of bytes represented by read chunks */
+   *byte_count = 0;
+   *ch_count = 0;
+   for (; ch-rc_discrim != 0; ch++) {
+   *byte_count = *byte_count + ch-rc_target.rs_length;
+   *ch_count = *ch_count + 1;
+   }
+}
+
+/*
+ * Decodes a write chunk list. The expected format is as follows:
+ *descrim  : xdr_one
+ *nchunks  : count
+ *   handle   : u32 RKEY  ---+
+ *   length   : u32 len

Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

2007-08-16 Thread Tom Tucker

On Wed, 2007-08-15 at 22:26 -0400, Jeff Garzik wrote:

[...snip...]

  I think removing the RDMA stack is the wrong thing to do, and you 
  shouldn't just threaten to yank entire subsystems because you don't like 
  the technology.  Lets keep this constructive, can we?  RDMA should get 
  the respect of any other technology in Linux.  Maybe its a niche in your 
  opinion, but come on, there's more RDMA users than say, the sparc64 
  port.  Eh?
 
 It's not about being a niche.  It's about creating a maintainable 
 software net stack that has predictable behavior.

Isn't RDMA _part_ of the software net stack within Linux? Why isn't
making RDMA stable, supportable and maintainable equally as important as
any other subsystem? 

 
 Needing to reach out of the RDMA sandbox and reserve net stack resources 
 away from itself travels a path we've consistently avoided.
 
 
  I will NACK any patch that opens up sockets to eat up ports or
  anything stupid like that.
  
  Got it.
 
 Ditto for me as well.
 
   Jeff
 
 
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [ANNOUNCE] NFS-RDMA for OFED 1.2 G/A

2007-07-24 Thread Tom Tucker

For those interested in NFS-RDMA, OGC has created an install package
based on the OFA 1.2 GA release. The package supports both SLES 10 and
RHEL 5. You can download this package from
http://www.opengridcomputing.com/nfs-rdma.html.

Please let me know if you find any problems.

Thanks,
Tom

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] [PATCH] amso1100: QP init bug in amso driver

2007-07-24 Thread Tom Tucker

Roland:

The guys at UNH found this and fixed it. I'm surprised no
one has hit this before. I guess it only breaks when the 
refcount on the QP is non-zero.

Initialize the wait_queue_head_t in the c2_qp structure.

Signed-off-by: Ethan Burns [EMAIL PROTECTED]
Acked-by: Tom Tucker [EMAIL PROTECTED]

---
 drivers/infiniband/hw/amso1100/c2_qp.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_qp.c 
b/drivers/infiniband/hw/amso1100/c2_qp.c
index 420c138..01d0786 100644
--- a/drivers/infiniband/hw/amso1100/c2_qp.c
+++ b/drivers/infiniband/hw/amso1100/c2_qp.c
@@ -506,6 +506,7 @@ int c2_alloc_qp(struct c2_dev *c2dev,
qp-send_sgl_depth = qp_attrs-cap.max_send_sge;
qp-rdma_write_sgl_depth = qp_attrs-cap.max_send_sge;
qp-recv_sgl_depth = qp_attrs-cap.max_recv_sge;
+   init_waitqueue_head(qp-wait);
 
/* Initialize the SQ MQ */
q_size = be32_to_cpu(reply-sq_depth);

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] OFED July 2, meeting summary on next OFED plans

2007-07-03 Thread Tom Tucker

Roland:

On Tue, 2007-07-03 at 15:35 -0500, Steve Wise wrote:
 Tom, can you update us on NFS-RDMA?
 
 
 Roland Dreier wrote:
NFSoRDMA integration.
  
  I would like to see a status report on NFS/RDMA from the people who
  want it in OFED.  As I understand it there are many core kernel
  changes required for this -- switchable transports and also mount
  option changes?

You are correct about the scope of the changes, although many of them
are already in the kernel. Chuck Lever just posted the mount changes and
I have posted a second round of the NFS-RDMA patches. You can see these
on [EMAIL PROTECTED] I would like to get them upstream in
2.6.23, but that's probably optimistic. 

  
  As far as I can tell from the outside, the NFS/RDMA effort seems to
  have stalled -- whenever I talk to core NFS developers like Chuck
  Lever or Trond Myklebust, they say that they are just waiting for the
  NFS/RDMA developers to submit their changes for review.  And I haven't
  seen any patches for a kernel newer that 2.6.18, so things look quite
  out-of-date.

I'm not sure when you talked to those guys, but as I mentioned, this is
round-two of the patch submission. There is also a git tree that has
these submitted patches available for download and testing. These are on
a 2.6.22-rc6 base and the git URL is
git://linux-nfs.org/~tomtucker/nfs-rdma-dev-2.6.git

If you like, I can post the patchset here as well.

  
  Without visible progress towards getting NFS/RDMA into mergeable form
  soon, I think putting it into OFED 1.3 as anything other than a
  technology preview that may be dropped from future releases would be a
  very risky think to do.  Otherwise OFED risks getting stuck
  maintaining the whole NFS/RDMA stack, since the development effort
  outside of OFED really looks to me like it is fizzling out.
  

Perhaps the activity is not where you're used to looking. Both Trond and
Neal reviewed the previous patchset and provided feedback that I
addressed in the most recent patchset. That said, I'm sure there will be
quite a bit more before it's mergeable. 

   - R.
  ___
  general mailing list
  general@lists.openfabrics.org
  http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
  
  To unsubscribe, please visit 
  http://openib.org/mailman/listinfo/openib-general
 

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: Incorrect max_sge reported in mthca device query

2007-04-05 Thread Tom Tucker

On Mon, 2007-04-02 at 09:08 +0300, Michael S. Tsirkin wrote:
  On Sun, 2007-04-01 at 09:43 +0300, Michael S. Tsirkin wrote:
[...snip...]
 I think that if we extend the API, we need to design it carefully
 to cover as many use cases as possible.
 Tom, could you explain what are you trying to do?
 Why does your application need as many SGEs as possible?
 
Mike:

The application is NFS-RDMA. NFS keeps it's data as non-contiguous
arrays of pages. So the motivation is that having a larger SGL allows
you to support larger data transfers with a single operation. 

The challenge with the current query/request method is that as we've
discussed the advertised max may not work. What makes the adjust/retry
unworkable is that you don't know which of the advertised maxes caused
the request to fail. So when you retry, which qp_attr do you adjust? The
send sge? The recv sge? The qp depth?

So what I'm proposing, and I think is similar if not identical to what
other folks have talked about is having an interface that treats the
qp_attr values as requested-sizes that can be adjusted by the provider.
So for example, if I ask for a send_sge of 30, but you can only do 28,
you give me 28 and adjust the qp_attr structure so that I know what I
got. This would allow me to perform a predictable sequence of 1. query,
2. request, 3. adjust in my code.

BTW, I think it needs to be new provider method to be done efficiently.
Also, what's a good name, ib_request_qp? 

Thanks,
Tom

 Also - what about out of resources cases described above?
 Would you expect the verbs API to retry the request for you?
 



___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [ofa-general] Re: Incorrect max_sge reported in mthca device query

2007-04-05 Thread Tom Tucker

On Thu, 2007-04-05 at 09:27 -0700, Sean Hefty wrote:
 The challenge with the current query/request method is that as we've
 discussed the advertised max may not work. What makes the adjust/retry
 unworkable is that you don't know which of the advertised maxes caused
 the request to fail. So when you retry, which qp_attr do you adjust? The
 send sge? The recv sge? The qp depth?
 
 So what I'm proposing, and I think is similar if not identical to what
 other folks have talked about is having an interface that treats the
 qp_attr values as requested-sizes that can be adjusted by the provider.
 So for example, if I ask for a send_sge of 30, but you can only do 28,
 you give me 28 and adjust the qp_attr structure so that I know what I
 got. This would allow me to perform a predictable sequence of 1. query,
 2. request, 3. adjust in my code.
 
 If the send sge/recv sge/qp depth/etc. aren't independent though, this pushes
 the problem and policy decision down to the provider.  I can't think of an 
 easy
 solution to this.

Agreed. But practically I think they are. I think the SGE max is driven
off the max size of a WR and type of QP. This is true of the iWARP
adapters as well.  

But taking the bait...even if you didn't push it down to the provider,
how do you expose the inter-relationships to the consumer? An approach
in this vein is a could_you_would_you/why_not interface that would
return whether or not the specified qp_attr would work and if it didn't
some indication of which resource(s) caused the problem. The problems
there are a) the resource may be gone when you go back with what you
just had approved, and b) you still have to fuss with multiple whacks
at it if you couldn't get what you asked for.

I think something simpler, although arguably not perfect is the way to
go.

Tom

 
 - Sean

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] Re: Incorrect max_sge reported in mthca device query

2007-04-01 Thread Tom Tucker

Michael:

Thanks for the detail reply. 

How about if we added an interface that would treat the SGE counts/WR
counts as requests and then update the qp_init_attr struct with what
was actually created? That would allow the app to request the max, but
settle for what the device was capable of at the time. 

On Sun, 2007-04-01 at 09:43 +0300, Michael S. Tsirkin wrote:
  Quoting Tom Tucker [EMAIL PROTECTED]:
  Subject: Incorrect max_sge reported in mthca device query
  
  
  Roland:
  
  I think the max_sge reported by mthca_query_device is off by one. If you
  try to create a QP with the reported max, it fails with -EINVAL. I think
  the reason is that the mthca_alloc_wqe_buf function reserves a slot for
  a bind request and this pushes the WQE size over the 496B limit when
  the user requests the max (30) when allocating the QP.
  
  Please let me know if I'm confused about what max_sge really means.
  
  Thanks,
  Tom
 
 Tom,
   max_sge reported by mthca_query_device is the upper bound
   for all QP types. I have not tested this, but think you can
   create a UD type QP with this number of SGEs.
 
   I'd like to add that there can be no hard guarantee that
   creating a QP with a specific set of max_sge/max_wr always
   succeeds even if it is within the range of values reported
   by mthca_query_device: for example, for userspace QPs, the
   system administrator might have limited the amount of
   memory that can be locked up by these QPs, and
   QP allocation requests with large max_sge/max_wr
   values will always fail. There are other examples of this.
   Thus, an application that wants to use as large a number of SGEs/WRs as
   possible in a robust fashion currently has no other choice except
   a trial and error approach, handling failures gracefully.
 
   Finally, as a side note, it is *also* inefficient to request
   allocation of more sge entries than ULP will typically
   use - for reasons such as cache utilization, and many others.
   How does this overhead trade-off against the need to sometimes 
 post
   multiple WRs by ULP will depend both on ULP and the hardware
   used. This need to tune the ULP to a specific HCA is annoying,
   and might be something that we want to try and solve at
   the API level. However, max_sge/max_wr values in query device
   are unlikely to be the appropriate API for this.
 
   One way out could be to extend the API for create_qp and friends,
   passing in both min and max values for some parameters,
   and allowing the verbs provider to choose the optimal combination
   of these. I think I floated a similiar proposal once already, but there
   didn't appear to be sufficient user support for such a large API
   extension.
 

___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

58 matches

Mail list logo