RE: [ofa-general] RE: [Bug 396] OFED 1.2 alpha DAPL failures using IntelMPI 3.0.33, kernel patching issues

2007-02-28 Thread Sean Hefty
This is the result of incorrect timeout values being used as a result of sean_cm_limit_mra_timeout_patch. Can someone tell me the purpose of this patch and how it became part of the OFED 1.2 build? This patch sets the timeout values incorrectly and needs to be removed from OFED. The purpose was

Re: [ofa-general] [PATCH, RFC] libibverbs: Add hooks for rereg_mr, memory windows

2007-03-01 Thread Sean Hefty
So does anyone see anything obviously missing or wrong here? It looks fine to me with one minor comment. +struct ibv_mw_bind { + struct ibv_mr *mr; + uint64_twr_id; + uint64_taddr; + uint64_tlength; The memory

RE: [ofa-general] librdmacm build failure

2007-03-01 Thread Sean Hefty
Does your configure.in need something this (cut from libibverbs configure.in)? I'm not a automake or configure whiz at all, so I'm just guessing. But libibverbs builds fine and it uses a version script too... It may... I tried to match the libibverbs build settings, so this could have been

Re: [ofa-general] rdma_cm issues in 2.6.21-rc1

2007-03-05 Thread Sean Hefty
Sean Hefty wrote: As just a note, I'm investigating two issues with the rdma_cm and 2.6.21-rc1. Running ucmatose twice results in a failure binding to an address the second time that it's run. I'm also seeing a kernel crash if ucmatose is killed while waiting for a connection. Just an update

Re: [ofa-general] Re: [openib-general] [PATCH] 2.6.20 ib_cm: limit cm message timeouts

2007-03-06 Thread Sean Hefty
+#define DRV_NAMEib_cm +#define PFX DRV_NAME : Just define PFX. + +/* + * Limit CM msg timeouts to something reasonable. + * 8 seconds, with up to 15 retries, gives per msg timeout of 2 min. + */ +#define IB_CM_MAX_TIMEOUT 21 Thinking out loud... maybe we should make

[ofa-general] [PATCH] ib_sa: set src_path_bits correctly in ib_init_ah_from_path

2007-03-07 Thread Sean Hefty
The src_path_bits needs to mask off the base LID value. Signed-off-by: Sean Hefty [EMAIL PROTECTED] --- Here's a first cut at setting the src_path_bits correctly. I don't think that this is important enough to push for 2.6.21, but if it looks okay I can queue it until 2.6.22 starts up. diff

[ofa-general] RE: [ewg] Re: OFED 1.2 beta blocking bugs

2007-03-08 Thread Sean Hefty
Not sure what you're asking, but just to be clear, this IPoIB HA is entirely in userspace (it's a crazy perl script that ups and downs ports in response to various events). Thanks - this helps. From a quick look at the code, it does look like there are some races in ipoib_multicast.c. The place

[ofa-general] [PATCH]] ucma backport to 2.6.19

2007-03-08 Thread Sean Hefty
prototype for show_abi_version changed between 2.6.20 to 2.6.19; this was the missing piece in the original backport patch. I would have expected a build warning for this. Signed-off-by: Sean Hefty [EMAIL PROTECTED] --- --- ofa_kernel-1.2/drivers/infiniband/core/ucma.c 2007-03-08 12:11

RE: [ofa-general] RE: OFED 1.2 beta blocking bugs

2007-03-08 Thread Sean Hefty
There are also IPoIB CM IP multicast problems, see bug 418. Bug 418 looks different than bug 400. From the bug report, it sounds like this error is limited to IPoIB CM mode. Is this correct? If you try to multicast packets 2KB, you see: I'm not sure if your hardware supports a max MTU of 2K,

Re: [ofa-general] RE: OFED 1.2 beta blocking bugs

2007-03-08 Thread Sean Hefty
Scott Weitzenkamp (sweitzen) wrote: Yes, it's limited to IPoIB CM. I'm talking about IP multicast not IB multicast, so the hardware MTU should be transparent. I'll reopen bug 418, who shall I assign it to? I think Michael owns the IPoIB CM code, so it should probable be assigned to him. I

[ofa-general] [PATCH] ib_ipoib: fix race detaching from mcast group before attaching

2007-03-08 Thread Sean Hefty
I believe this is a simple fix for the detach before attach race that Roland pointed out. I only did some limited testing on my systems, so I can't say that it will fully fix bug report 400. Roland, if this looks good to you, let me know and I can push it out to my git tree. Signed-off-by: Sean

Re: [ofa-general] RFC: pull from 2.6.21

2007-03-12 Thread Sean Hefty
1. merged_sean_rdma_dev_ofed_1_2.patch - I think all multicast bits are merged in 2.6.21-rc3 so we only have to take code from local_sa branch now. Right? Correct - though I would need to updated my branches to 2.6.21-rc3 first, which I will do today. - Sean

Re: [ofa-general] RE: bug 400: ipoib error messages

2007-03-14 Thread Sean Hefty
Right now I am having a hard time getting failures to happen, I'll keep trying. Your last report mentioned that you were running OFED-1.2-20070311-0600. Is this still the case? A fix for the multicast detach race went into OFED 1.2 on March 11th. I don't know if it made it into the

Re: [ofa-general] Re: [openib-general] [PATCH] 2.6.20 ib_cm: limit cm message timeouts

2007-03-14 Thread Sean Hefty
How about the attached fix to Sean patch? I've created an updated patch that I will queue for 2.6.22. I'm working on importing it into OFED 1.2, and should have that shortly. - Sean ___ general mailing list general@lists.openfabrics.org

RE: [ofa-general] OFA newbie question: module load/unload

2007-03-14 Thread Sean Hefty
What about bumping the module reference count when a kernel client (eg ipoib) is using a device? How is the driver made unloadable in such cases? The driver is still unloadable in such cases. The kernel client is notified when a device is removed and is expected to release any resources

Re: [ofa-general] Re: [GIT PULL] OFED 1.2: CM scaling fixes

2007-03-14 Thread Sean Hefty
In any case, I'm updating the patch... Somethings wrong with the OFED git tree. Looking online, I don't see any changes to the git log since early February. If I clone the git tree, however, I do see recent log messages/changes. ??? I went to update the file

Re: [ofa-general] Re: [GIT PULL] OFED 1.2: CM scaling fixes

2007-03-14 Thread Sean Hefty
Vlad, please pull from: git://git.openfabrics.org/~shefty/ofed_1_2.git ofed_1_2 This should add some necessary fixes to the OFED code: RDMA/ucma: avoid sending reject if backlog is full RDMA/cma: Request reversible paths only IB/cm: remove broken MRA timeout

[ofa-general] [Bug 400] ipoib error messages

2007-03-15 Thread Sean Hefty
Scott Weitzenkamp (sweitzen) wrote: Yes, I was using 20070311, and I see the patch in 20070312. I'll try it. Scott, have you had a chance to test with 20070312, and, if so, did you see the mcast detach issue? - Sean ___ general mailing list

RE: [ofa-general] [PATCH] ib_ipoib: fix race detaching from mcast group before attaching

2007-03-19 Thread Sean Hefty
What's the theory here? It's not obvious why moving the call to ib_sa_free_multicast() fixes the race... The attach QP only occurs in the context of the multicast callback thread. ib_sa_free_multicast() blocks until the callback returns, which ensures that the detach check/call (which is now

[ofa-general] RE: [RFC] host stack IB-to-IB router support

2007-03-19 Thread Sean Hefty
Would it become part of openfabrics or just as a 3rd party patch that interested parties could apply? Portions of the changes should be suitable for upstream submission. The ib_remote_sa module could be added to an OFED release if there was enough demand. So the idea is that the CM REQ now uses

[ofa-general] RE: [RFC] host stack IB-to-IB router support

2007-03-19 Thread Sean Hefty
Hmm. If the goal is enable router development and experimentation then it would be best if the 'ib_remote_sa' server was in user space, delt with all 4 path records in one query and was centralized so it could be made to store routing topology and configuration to solve the multipath problems.

Re: [ofa-general] Re: [GIT PULL] OFED 1.2: CM scaling fixes

2007-03-20 Thread Sean Hefty
Sean Hefty wrote: Vlad, please pull from: git://git.openfabrics.org/~shefty/ofed_1_2.git ofed_1_2 This should add some necessary fixes to the OFED code: RDMA/ucma: avoid sending reject if backlog is full RDMA/cma: Request reversible paths only IB/cm: fix MRA timeout patch

Re: [ofa-general] [PATCH] use LIDs from REQ LRH for inter-subnet connections

2007-03-20 Thread Sean Hefty
When you get a chance, can you try out this patch? I tested that it worked for a local subnet connection by commenting out the hop_limit check. So, I'm interested to know if you run into any problems. If you do run into issues, madeye may be able to help. I've reworked this patch, and added

Re: [ofa-general] [RFC] host stack IB-to-IB router support

2007-03-21 Thread Sean Hefty
Ok, lets assume Sean would finish his experiments with remote_sa, how would that find its way into the commercial sm/sa versions that are mostly used, how would we guarantee interoperability between all implementations, .. ? How would that address future routing, security, QoS, .. enhancements ?

RE: [ofa-general] Re: [GIT PULL] OFED 1.2: CM scaling fixes

2007-03-21 Thread Sean Hefty
/infiniband/core/cm.c index 842cd0b..706fdbf 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -54,6 +54,17 @@ MODULE_AUTHOR(Sean Hefty); MODULE_DESCRIPTION(InfiniBand CM); MODULE_LICENSE(Dual BSD/GPL); +#define PFXib_cm: + +/* + * Limit CM message timeouts to something

RE: [ofa-general] [RFC] host stack IB-to-IB router support

2007-03-21 Thread Sean Hefty
If its not my subnet read the DGID from a table (or even a config for now) And conduct SA query on that one On the remote side, add the reverse lookup rather than use the CM REQ SLID Trying to perform SA queries inside the CM protocol/state machine on the passive side is actually fairly complex.

Re: [ewg] RE: [ofa-general] Re: [GIT PULL] OFED 1.2: CM scaling fixes

2007-03-22 Thread Sean Hefty
OK, so can you change the default to lower value in your branch? Done - set to 21 (~8 seconds) ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit

Re: [ofa-general] Re: Re: [PATCH] IB/umad: fix GRH handling

2007-03-23 Thread Sean Hefty
Overall, looks safe. If you want the fix in OFED 1.2, file a bug in the bugzilla. I've made one adjustment to the original patch to set the hop_limit to 0xff on receives. The updated patch is in the ib_router branch of my git tree. But - is this patch going into 2.6.21? And if not, why

RE: [ofa-general] madeye kernel oops

2007-03-27 Thread Sean Hefty
How easily can you reproduce this? I'm assuming that this is with OFED 1.2 on 2.6.20, correct? Can you describe what you were doing when this crash occurred? Thanks, Sean Unable to handle kernel NULL pointer dereference at 0038 RIP: [8801021f]

RE: [ofa-general] Re: Incorrect max_sge reported in mthca device query

2007-04-05 Thread Sean Hefty
The challenge with the current query/request method is that as we've discussed the advertised max may not work. What makes the adjust/retry unworkable is that you don't know which of the advertised maxes caused the request to fail. So when you retry, which qp_attr do you adjust? The send sge? The

[ofa-general] [GIT PULL] 2.6.22: please pull rdma-dev.git

2007-04-05 Thread Sean Hefty
Roland, please review and pull patches from git.openfabrics.org/~shefty/rdma-dev.git for-roland This will pull in some patches that I would like queued for 2.6.22. Sean Hefty (6): rdma_ucm: simplify ucma_get_event code ib_ucm: simplify ib_ucm_event code ib_sa: set

[ofa-general] [GIT PULL] OFED 1.2: please pull librdmacm.git ofed_1_2

2007-04-06 Thread Sean Hefty
Vlad, Please update the ofed 1.2 librdmacm branch from git://git.openfabrics.org/~shefty/librdmacm.git ofed_1_2 This will update ofed to librdmacm 1.0-rc2. The only notable code change is a fix for bug 521, which allows 32-bit userspace to work with 64-bit kernel. - Sean

RE: [ofa-general] RNR NAK issues

2007-04-09 Thread Sean Hefty
One of the things that I discovered was that in cm.c qp_attr-min_rnr_timer was set to 0. What is the purpose of settng this to 0? How are drivers expected to use this? I see that mthca does some computation. Probably because of this ( min_rnr_timer = 0) ehca appears to use this value and sets it

RE: [ofa-general] [RFC] IB management changes proposal

2007-04-10 Thread Sean Hefty
These are IB diags and not RDMA diags though. What would a better name be ? Isn't the entire management directory IB specific? If so, you could just leave it at diags. I do prefer opensm over osm though. - Sean ___ general mailing list

RE: [ofa-general] RE: How fast to get RDMA_CM_EVENT_DISCONNECTED ?

2007-04-11 Thread Sean Hefty
Serveral seconds to detect connection failure is not acceptable for us, so if I use rdmacm, I want to know if I detect the connection failure faster than heart-beat message. In general, use of the rdma or ib cm will not help detect failures on active connection any faster. If the remove process

RE: [ofa-general] Re: multicast join failed for...

2007-04-12 Thread Sean Hefty
The job will continue running though, and when you diagnose the problem and disconnect the bad node, rate will be back to high. So what's the problem? What would bring the rate back up? Halting all multicast traffic across the subnet to handle a flaky node wanting to join some multicast

[ofa-general] [RFC] [PATCH 1/3] 2.6.22 or 23 ib/sa: add registration for sa events

2007-04-19 Thread Sean Hefty
IB/sa: Add InformInfo/Notice support. From: Sean Hefty [EMAIL PROTECTED] Add SA client support for notice/trap registration using InformInfo. Clients can use the ib_sa interface to register for SA events based on trap numbers, and receive SA event notification. This allows clients to receive

[ofa-general] [RFC] [PATCH 3/3] 2.6.22 or 23 rdma/cm: check cache for path records

2007-04-19 Thread Sean Hefty
RDMA/cma: use local SA cache for path queries From: Sean Hefty [EMAIL PROTECTED] Have the rdma_cm check the local SA cache for path records before querying the remote SA. This improves path record lookup time and scale-out connection rates. Signed-off-by: Sean Hefty [EMAIL PROTECTED

Re: [ofa-general] [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache

2007-04-20 Thread Sean Hefty
Once SM is up on a node/switch whole network is up. Now is if some client is trying to establish a connection with other node, client is expected to resolve the path using sa API, I want to know how exactly it happens in the stack? See patch 3/3 for the use of the cache. In that patch, the

Re: [ofa-general] [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache

2007-04-23 Thread Sean Hefty
If some client calls cma_resolve_ib_route(), and let's assume that its local cache miss, and cma_query_ib_route() is called, this will send a SA query to the SM node CMIIW, Now on SM node I am not able to figure out that who will respond this GMP, and how requested attribute info is collected?

[ofa-general] RE: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache

2007-04-23 Thread Sean Hefty
A straight-forward approach would be to listen for port up/down events rather than or in addition to GID in/out, and do network discovery by DR SMPs. I'm not entirely following you. How would you listen for port up/down events? And are you suggesting that all nodes do network discovery using DR

[ofa-general] RE: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache

2007-04-23 Thread Sean Hefty
Isn't there a way to get notice for this? The closest trap I'm aware of is GID in/out of service. See 14.2.5.1 and 14.4.9. GID in/out of service is related to the existence of a path record between the SGID and DGID. If the path record parameters change, I'm not sure if the GID technically

RE: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache

2007-04-23 Thread Sean Hefty
Has anyone thought about using replication rather than caching to solve this problem? It seems to me it would be alot faster for some single process in the network to fetch and keep a copy of the entire SA route database, format it into a binary format and use RC RDMA to transfer it to every node

RE: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add pathrecord cache

2007-04-23 Thread Sean Hefty
We could solve this by implementing a process running on the same node as the SA. And it's probably not too hard to add a way for opensm to spit out the table into an external file when it gets a signal or something. I agree that there are ways to solve this, but those solutions won't work with

RE: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: addpathrecord cache

2007-04-23 Thread Sean Hefty
Maybe, but there might be several other reasons for this. One might be that IPoIB is slower than link speeds, so e.g. miscalculating the rate still does not cause network failures. Another might be that people run TCP mostly, which is very good at recovering from failures, so if you get the LID

Re: [ofa-general] Re: [RFC] [PATCH 2/3] 2.6.22 or 23 ib/sa: add path record caching

2007-04-24 Thread Sean Hefty
+static struct miscdevice local_sa_misc = { + .minor = MISC_DYNAMIC_MINOR, + .name = ib_local_sa, +}; I don't understand why you're registering a miscdevice etc. I don't see any implementation of a character device or indeed any userspace interface at all. So what's up here? The

Re: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V2] patch for review

2007-04-24 Thread Sean Hefty
If the ACK delays on both sides are not being taken into account properly when establishing a connection, then I guess that is a bug in our CM. I looked, and the cm does not take into account the ca ack delay. This can be worked around by bumping up the qp timeout value between calling

Re: [ofa-general] Re: IPOIB CM (NOSRQ)[PATCH V2] patch for review

2007-04-25 Thread Sean Hefty
What really should happen is that the field Local Ack Timeout in REQ should be (2 * PacketLifeTime + Local CA’s ACK delay) (see 12.7.34) and then the responder should use this for it's QP. Just to clarify, the value is _based_ on (2 * PacketLifeTime + local CA ack delay). For example, if

[ofa-general] RE: hotplug event handle question

2007-04-25 Thread Sean Hefty
Should I move the QP to the error state in RDS or cma should handle this state too. Let me think about what to do here. I think the cma should perform this transition if it makes sense. - Sean ___ general mailing list general@lists.openfabrics.org

RE: [ofa-general] Re: [RFC] [PATCH 2/3] 2.6.22 or 23 ib/sa: add path record caching

2007-04-25 Thread Sean Hefty
That seems like an abuse of the miscdevice stuff, since you don't actually have a device. Why not just use module parameters? The only difference would be that the paths start /sys/module/ib_local_sa/parameters instead. Or if you really wanted to, I guess a sysctl would be appropriate. But I

RE: [ofa-general] Re: [RFC] [PATCH 2/3] 2.6.22 or 23 ib/sa: add path record caching

2007-04-25 Thread Sean Hefty
Sure... you'll have to implement your own set method but that's no different from putting attributes under your miscdevice. Just look at module_param_call() -- it's exactly what you want I think. Thanks - I'll update this. ___ general mailing list

[ofa-general] RE: [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache

2007-04-25 Thread Sean Hefty
We can get notic on port state changes though, can't we? Trap 128 (sent by switches) indicates that the link state at least one port of switch at LIDADDR has changed. I think it would be difficult for all nodes to determine which paths were affected. - Sean

[ofa-general] autotools question

2007-04-25 Thread Sean Hefty
Has anyone run into an issue with autotools not generating the .so extension to built library files, or know how to fix such an issue? - Sean ___ general mailing list general@lists.openfabrics.org

[ofa-general] RE: [Bug 581] rdma_get_src_port() not returning the correct port.

2007-04-25 Thread Sean Hefty
Can you give this a try? The source address was being overwritten by whatever the user passed into rdma_bind_addr. Signed-off-by: Sean Hefty [EMAIL PROTECTED] --- diff --git a/src/cma.c b/src/cma.c index c5f8cd9..fdadb69 100644 --- a/src/cma.c +++ b/src/cma.c @@ -509,12 +509,7 @@ int

RE: [ofa-general] hotplug event handle question

2007-04-26 Thread Sean Hefty
I think the problem is that cma_remove_id_dev overrides the current state, losing state information in the process. Why do we need CMA_DEVICE_REMOVAL at all? Everything seems to work fine just by forwarding RDMA_CM_EVENT_DEVICE_REMOVAL to user, without touching state. I need to read back over the

RE: [ofa-general] hotplug event handle question

2007-04-26 Thread Sean Hefty
At the very least we need to repeat the check: if (!cma_comp(id_priv, CMA_DESTROYING)) return 0; here to avoid calling the user after they've tried to destroy their id from another callback. See comment above. OK. Would that be enough? Off the top of my head, I don't

RE: [ofa-general] Re: [RFC] [PATCH 0/3] 2.6.22 or 23 ib:addpathrecordcache

2007-04-26 Thread Sean Hefty
Sure. But how that will interact with whatever extensions are going into 1.2.1 is hard for me to guess. I agree. My point is that the cache should be spec compliant now, with support for any potential extensions in 1.2.1 coming later. Even once 1.2.1 is released SAs will need time to

RE: [ofa-general] Fwd: Re: using stgit/guilt for public branches

2007-04-26 Thread Sean Hefty
FYI. I posted a question on git mailing list, asking about best ways to manage ofed repository. http://article.gmane.org/gmane.comp.version-control.git/45519 The conclusion so far seems to be that what we are doing (keeping patches under version control) is basically the right way to do it:

[ofa-general] bug in cma_iw_handler? (was hotplug event handle question)

2007-04-26 Thread Sean Hefty
Off the top of my head, I don't think so. Since the state is staying the same, we now have the potential of another thread invoking a callback to the same id. For example, the ib_cm could callback with a connect or reject event, which gets propagated to the user. The user will now see two

Re: [ofa-general] [RFC] [PATCH 0/3] 2.6.22 or 23 ib: add path record cache

2007-04-26 Thread Sean Hefty
I've updated these patches based on the feedback received so far: * Added definition for missing trap 259. * Replaced miscdevice usage with module parameters. * Add module parameter to control SA event registration. There is still one issue wrt the API that I'd like to get more opinions on.

Re: [ofa-general] APM Example

2007-04-26 Thread Sean Hefty
I was wondering whether there is an example of using APM with Openfabrics. I do not see an example in the examples directory. With the OFED 1.2 rc2 too, i have not seen such an example. I have a kernel ib_cm test program (cmpost) that I've used to test APM. It's available from:

[ofa-general] [PULL] ofed_1_2: branches for libibcm and librdmacm

2007-04-27 Thread Sean Hefty
Please pull: git://git.openfabrics.org/~shefty/libibcm.git ofed_1_2 and git://git.openfabrics.org/~shefty/librdmacm.git ofed_1_2 into OFED 1.2. This will pick up: * librdmacm: set source port after calling rdma_bind_addr. * rping: Transfer rkey/addr/len information in network

Re: [ofa-general] man pages for the rdma-cm

2007-05-03 Thread Sean Hefty
Are there man pages for the rdma-cm in the pipeline? I think it would be great (requirement?) to have these for ofed-1.2 since we do have the other verbs man pages. I didn't know if this was in-progress or are we looking for volunteers... I don't have man pages, but I did update the

RE: [ofa-general] man pages for the rdma-cm

2007-05-03 Thread Sean Hefty
Was just delayed to Monday. If you can do it today/tomorrow we may be able to integrate it I will try to complete the man pages for at least the APIs by tomorrow. I'm about 70% done writing them, but still need to tie them in with the build scripts. Steve, I will push the rping changes in with

[ofa-general] RE: man pages for the rdma-cm

2007-05-06 Thread Sean Hefty
Are there man pages for the rdma-cm in the pipeline? I think it would be great (requirement?) to have these for ofed-1.2 since we do have the other verbs man pages. I've added man pages for the APIs and test programs to my master and ofed_1_2 branches. If anyone gets a chance, I'd appreciate

Re: [ofa-general] RE: man pages for the rdma-cm

2007-05-07 Thread Sean Hefty
Here are a few comments. Consider them for inclusion, but what you've done so far is a great start. Thanks for the feedback. I'll try to update this before RC3 freezes. - rdma_disconnect - for iWARP connections, this initiates a RDMAC Verbs normal close. If the connection was properly

[ofa-general] [PATCH 1/3] rdma/cm: simplify device removal handling code

2007-05-07 Thread Sean Hefty
Add a new routine and rename another to encapsulate common code for synchronizing with device removal. Signed-off-by: Sean Hefty [EMAIL PROTECTED] --- drivers/infiniband/core/cma.c | 89 ++--- 1 files changed, 48 insertions(+), 41 deletions(-) diff --git

[ofa-general] [PATCH 2/3] rdma/cm: Fix synchronization with device removal in cma_iw_handler

2007-05-07 Thread Sean Hefty
, or a callback after they've destroyed the cm_id. Signed-off-by: Sean Hefty [EMAIL PROTECTED] --- drivers/infiniband/core/cma.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index d026764..cfd57b4 100644 --- a/drivers

[ofa-general] [PATCH 3/3] rdma/cm: Add check to validate that cm_id is bound to a device

2007-05-07 Thread Sean Hefty
. This will allow a user to disconnect a cm_id or reject a connection after receiving a device removal event. Signed-off-by: Sean Hefty [EMAIL PROTECTED] --- drivers/infiniband/core/cma.c | 12 1 files changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/core/cma.c b

[ofa-general] RE: man pages for the rdma-cm

2007-05-08 Thread Sean Hefty
I updated the man pages in my master branch and pushed the changes out. Details below. - are the events described anywhere? Maybe they should be described in rdma_get_cm_event? done - rdma_accept / rdma_connect: describe the conn_param fields. done - rdma_bind_addr: binding to port 0 will

[ofa-general] [GIT PULL] 2.6.22: please pull rdma-dev.git

2007-05-09 Thread Sean Hefty
Roland, please pull from: git://git.openfabrics.org/~shefty/rdma-dev.git for-roland This will cleanup device removal synchronization in the rdma_cm. The changes are based on 2.6.21. Sean Hefty (3): rdma/cm: simplify device removal handling code rdma/cm: Fix synchronization

Re: [ofa-general] Re: [OMPI devel] OMPI over ofed udapl - bugs opened

2007-05-09 Thread Sean Hefty
The reason it is hard or impossible to solve this in the DAPL layer is that any rdma operation on the QP affects the state of that QP and the associate CQs. In addition, if you use an RDMA send to enforce this you impact the other side by consuming a RECV buffer. So its hard if not impossible to

[ofa-general] RFC: location for IB CM statistics

2007-05-10 Thread Sean Hefty
I'd like to start adding some statistical information to the IB CM to help identify scalability or connectivity issues. Some example statistics that I would like to expose now are number of retried MADs, unmatched requests, total number of connections, etc. Longer term, additional statistics and

[ofa-general] RE: [Query] ib add path record cache

2007-05-14 Thread Sean Hefty
This can be treated as a facility similar to what we have in ARP table for TCP/IP. Secondly this will help in debugging of some new up-coming partially infiniband complaint hardware. But unless such a path actually exists to the remote node, I don't see that it's useful. And if such a path

Re: [ofa-general] ibv_modify_port?

2007-05-15 Thread Sean Hefty
I would like to propose a better interface. What if there were a generic DM agent in the kernel that provided an API for target devices (kernel and user) to register IOC's with it? It might look something like this: A generic DM makes sense. There are existing interfaces / implementations

Re: [ofa-general] [PATCH 1/2] OpenSM/ib_cm_types.h: Move CM definitions from ib_types.h

2007-05-16 Thread Sean Hefty
Hal Rosenstock wrote: OpenSM/ib_cm_types.h: Move CM definitions from ib_types.h CM types are defined in the libibcm library. Why not remove them completely from the opensm code? - Sean ___ general mailing list general@lists.openfabrics.org

Re: [ofa-general] Re: [Query] ib add path record cache

2007-05-16 Thread Sean Hefty
But initially this will generate a packet for each path, while sys admin knows that path is there and he can hard-code the entries for it. Other thing is that why Admin will care about creating such record while SA is itself taking care, right? In your original message you asked about adding

Re: [ofa-general] libibcm compatability problem

2007-05-17 Thread Sean Hefty
I'm using a 2.6.20.1 http://2.6.20.1 kernel with OFED 1.1. I get the following message when running my application: libibcm: Kernel ABi version 5 doesn't match library version 4. Could someone tell me what version of the library in terms of OFED release I should be using? I'm not sure if

Re: [ofa-general] libibcm compatability problem

2007-05-18 Thread Sean Hefty
Also I see that the function ib_cm_get_device has been removed. I was using this to monitor the file desriptor of the CM device. Could this function be put back into my local copy of libibcm or has this function been moved somewhere else in the code? The fd is exposed directly by walking

[ofa-general] [RFC] [PATCH 0/3] 2.6.23: basic support for IB routers

2007-05-18 Thread Sean Hefty
that were already pushed for 2.6.22. Signed-off-by: Sean Hefty [EMAIL PROTECTED] ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman

[ofa-general] [RFC] [PATCH 2/3] 2.6.23: IB/cm: Modify passive side to use LIDs from LRH for routed connections

2007-05-18 Thread Sean Hefty
To support inter-subnet connections, the passive endpoint needs to use its subnet local LIDs. The LIDs carried in the REQ are currently the LIDs from the active subnet (SLID and router LID). Replace LIDs in the REQ with subnet local LIDs from LRH. Signed-off-by: Sean Hefty [EMAIL PROTECTED

Re: [ofa-general] IB/cm: bug in stale connection detection logic?

2007-05-21 Thread Sean Hefty
1. I see this in cm_match_req: timewait_info = cm_insert_remote_id(cm_id_priv-timewait_info); if (!timewait_info) timewait_info = cm_insert_remote_qpn(cm_id_priv-timewait_info); if (timewait_info) { cur_cm_id_priv =

Re: [ofa-general] Re: [PATCH for-2.6.22] IB/cm: improve local id allocation

2007-05-21 Thread Sean Hefty
good. Thanks Acked by: Sean Hefty [EMAIL PROTECTED] ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] IB/cm: bug in stale connection detection logic?

2007-05-21 Thread Sean Hefty
Could you please post a patch? Let's discuss whether it's appropriate for 2.6.22 separately. I mentioned 2.6.23 because it affects when I have to generate the patch. :) I will try to get to this tomorrow then. - Sean ___ general mailing list

[ofa-general] [PATCH] ib/cm: fix stale connection detection

2007-05-21 Thread Sean Hefty
The ib_cm can incorrectly detect a stale connection (a new connection request for a QPN that is already connected) as a duplicate connection request. Separate the handling of potential duplicate REQs from stale connections. Signed-off-by: Sean Hefty [EMAIL PROTECTED] --- Can you let me know

Re: [ofa-general] Re: [PATCH] ib/cm: fix stale connection detection

2007-05-22 Thread Sean Hefty
The patch looks obvious enough for 2.6.22, safe enough in that it replaces a timeout with a reject, and it addresses a real problem. Sean? Roland? What do you think? To make it easier, I've added the patch to: git://git.openfabrics.org/~shefty/rdma-dev.git for-roland commit

[ofa-general] RE: Problem with using two interfaces with rdma-cm

2007-05-23 Thread Sean Hefty
Rail 1 (ib0): 192.168.1.* Rail 2 (ib2): 192.168.3.* When I try to connect two qps over these rails (one on each), many times the address resolutions for both the qps return me the context of just one of the rails. i.e. I am not able to use both the rails. Is there any thing I am missing here?

RE: [ofa-general] Re: [PATCH] ib/cm: fix stale connection detection

2007-05-23 Thread Sean Hefty
If two REQs are received with matching local IDs, but the REQs themselves differ on one or more fields, such as the QPN, the second REQ is dropped as a duplicate. Why do you speak about dropping duplicates as a valid response? I was only mentioning the current behavior. As far as I can tell,

[ofa-general] RE: Problem with using two interfaces with rdma-cm

2007-05-23 Thread Sean Hefty
I am able to ping on both the interfaces. The ping messages go over both the interfaces. Infact after pinging the interfaces from each other a few times I am able to connect properly over both the rails for some time. After a few minutes it falls back to just one interface. Odd - I will see if I

[ofa-general] [PATCH] for-2.6.23 ib/sa: use correct index for default pkey

2007-05-23 Thread Sean Hefty
MADs sent to the SA should use the index for the default pkey. There's no requirement that the default pkey be stored at index 0. Signed-off-by: Sean Hefty [EMAIL PROTECTED] --- Patch requires the latest changes to the pkey cache. This fix is not a priority, but it appears to be the only issue

Re: [ofa-general] RE: Problem with using two interfaces with rdma-cm

2007-05-23 Thread Sean Hefty
Steve Wise wrote: Guys, this reminds me of an issue we have with rnics and regular nics on the same physical network. By default linux responds to arp queries on all ports it receives the query on. This leads to very bad results with you're trying to do offloaded connections. When resolving

[ofa-general] [PATCH] ib/cm: optimize locking

2007-05-23 Thread Sean Hefty
The ib_cm is a little over zealous about using spin_lock_irqsave, when spin_lock_irq would do. Signed-off-by: Sean Hefty [EMAIL PROTECTED] --- This patch applies on top of ib/cm: fix stale connection detection. It has only been lightly tested using the librdmacm. Additional testing with ipoib cm

Re: [ofa-general] Re: [Query] ib add path record cache

2007-05-24 Thread Sean Hefty
Yes It will, and hence reduce the initial SA traffic generated on a big cluster...just imagin, the cluster is quite big and every node is trying to build its cache initially. It will create large burst of SA packets. In general I agree with the notion of enhancing the cache to allow it to load

[ofa-general] [PATCH] 2.6.23 ib/cm: include HCA ACK delay in local ACK timeout

2007-05-25 Thread Sean Hefty
The ib_cm should include the HCA ACK delay when calculating the local ACK timeout value. If the HCA ACK delay is large enough relative to the packet life time, then the calculated timeout value is too small, which can result in connections timing out or excessive retries. Signed-off-by: Sean

RE: [ofa-general] Re: [Query] ib add path record cache

2007-05-29 Thread Sean Hefty
Ok, but, by that time we can keep the framework ready? I plan on re-submitting the cache for 2.6.23. Beyond that I won't have the time to work on enhancements for a few weeks. I will happily review any patch submissions though. How this will be managed? This will add extra startup time in the

RE: [ofa-general] Re: [Query] ib add path record cache

2007-05-30 Thread Sean Hefty
Ok, Soon I will post a patch related to this. How static PR file will be generated? Needs to be discussed. Please look at my latest changes to the local SA in when generating the patches. git://git.openfabrics.org/~shefty/rdma-dev.git sa_cache I'm not sure about the best way to communicate PRs

[ofa-general] [RFC] [PATCH 0/2] for 2.6.23: ib/sa - add local path record caching

2007-05-30 Thread Sean Hefty
pushed these changes to: git://git.openfabrics.org/~shefty/rdma-dev.git sa_cache I would like to close any open issues with this approach in time to pull it into 2.6.23. Signed-off-by: Sean Hefty [EMAIL PROTECTED] ___ general mailing list general

[ofa-general] [RFC] [PATCH 1/2] for 2.6.23: ib/sa - add InformInfo/Notice support

2007-05-30 Thread Sean Hefty
Add SA client support for notice/trap registration using InformInfo. Clients can use the ib_sa interface to register for SA events based on trap numbers, and receive SA event notification. This allows clients to receive notification, such as GID in/out of service. Signed-off-by: Sean Hefty

Re: [ofa-general] Re: [RFC] [PATCH 2/2] for 2.6.23: ib/sa - add local path record caching

2007-05-31 Thread Sean Hefty
Michael S. Tsirkin wrote: It seems that below you try to get 0x7F paths to each dest: This is the maximum number that a PR can request. Note that you only get that many if that many exist. I would expect most subnets to only have a couple of paths between each destination. But here you

Re: [ofa-general] Re: [Query] ib add path record cache

2007-05-31 Thread Sean Hefty
Do you have some pointer/doc related to the design of current SA_CACHE moduleIt will make things faster to understandif not then I will require your support to understand the things, Though I have some top level view. I don't have any design docs. But I will happily answer any

  1   2   3   4   5   6   7   8   >