RE: [ofa-general] How to establish IB communcation more effectively?

2009-05-12 Thread Sean Hefty
>Yes, of-course. But, to start with, lets analyze the case of each node
>running --one-- rank and then take it from there to the case where
>each node runs C ranks.

The caching is independent of running MPI though.  To get a fair comparison,
you'd probably have to reboot the entire cluster before running the test and
ensure that no other communication between the nodes occurs over ipoib.

For myself, I'm not sure that the tests are the same.  The DAPL providers create
and modify the QPs differently.  I'd need to walk through the code to see
whether QP creation time is included and verify that the QP modify calls are the
same.

As for responding to the initial question, using sockets with hard-coded values
seems to be the most common way to establish IB connections at scale, though I
would guess that using the ib_cm with hard-coded values would work about the
same.
 
- Sean

___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] How to establish IB communcation more effectively?

2009-05-12 Thread Or Gerlitz
Davis, Arlin R  wrote:
>>Just to make sure we're on the same page: both IPoIB and the RDMA-CM
>>use SA path queries (ipoib for the unicast arp reply, and rdma-cm for
>>rdma_resolve_route), going into details, things look like:

> I am running IPoIB connected so I assume there is no path query
> and I see no difference in IPoIB unconnected mode so I also assume
> it caches path records during ARP processing. Can someone confirm?

Arlin,

Both the datagram and connected mode issue path query (its the way IB
works). The datagram mode uses the IB UD (Unreliable Datagram)
transport and once the path is resolve it creates IB AH (Address
Handle) which is used in conjunction with the UD QP. The connected
mode uses the IB RC (Reliable Connection) transport, so path info is
used to establish it connection through the IB CM.

> ARP cache is also hit in all these cases so you can take ARP request/reply 
> out.

I am not with you: by "ARP cache" I assume you refer to the networking
stack neighbour table, correct? so this cache has the entries since
the IPoIB network was also used to spawn the job?

> However, with rdma_cm we actually have to pick up the ADDR_RESOLVED (arp)
> event before moving on to the rdma_resolve_route (path record), and then wait 
> for
> ROUTE_RESOLVED event before moving on to the rdma_connect call, and then
> finally wait for ESTABLISHED. You start to get the picture of where my time 
> goes? > Not only do we have path record query delays we have a 3 step event
> processing (waiting/waking on each) just to get connected.

Yes, this sounds like a potentially big difference from the TCP case,
lets see how many kernel --> user events we have in both methods --

rdma-cm active side
---
addr-resolved
route-resolved
established

rdma-cm passive side
--
connection-request
established

scm active side
--
connected

scm passive side

connection request
connected

in the rdma-cm framework there are three kernel -->user
transitions/events for the active and two for the passive, where in
the scm framework there are two for the passive but only one for the
active. Also counting user --> kernel transitions, in the rdma-cm
active side there are three vs only one in the scm. This sounds like
where things would probably makes a difference. I believe it could be
fairly easy to have the kernel rdma ucm module do two successive calls
(route resolve and connect) once the local address is resolved, since
at that point the user space consumer can create their QP, etc.

> Not only do we have path record query delays

So we agree that its path query --delays-- and for one rank per node
its the same # of path queries? (Sean)

Or.
___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] How to establish IB communcation more effectively?

2009-05-12 Thread Or Gerlitz
>>Just to make sure we're on the same page: both IPoIB and the RDMA-CM
>>use SA path queries

> But ipoib caches its path records...

Yes, of-course. But, to start with, lets analyze the case of each node
running --one-- rank and then take it from there to the case where
each node runs C ranks.

Or.
___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] How to establish IB communcation more effectively?

2009-05-12 Thread Davis, Arlin R
 
>Davis, Arlin R  wrote:
>> For a connection (socket connect, exchanging QP info, 
>private data, qp modify)
>> using uDAPL socket cm versus rdma_cm I get:
>> socket_cm on 1Ge == ~900us
>> socket_cm on IPoIB (mlx4 ddr) == ~400us
>> rdma_cm on IB (mlx4 ddr) == ~2200us
>> As you can see, the path record queries via rdma_cm add a 
>substantial penalty.
>
>Hi Arlin,
>
>Just to make sure we're on the same page: both IPoIB and the RDMA-CM
>use SA path queries (ipoib for the unicast arp reply, and rdma-cm for
>rdma_resolve_route), going into details, things look like:

I am running IPoIB connected so I assume there is no path query 
and I see no difference in IPoIB unconnected mode so I also assume 
it caches path records during ARP processing. Can someone confirm? 

ARP cache is also hit in all these cases so you can take 
ARP request/reply out. However, with rdma_cm we actually 
have to pick up the RDMA_CM_EVENT_ADDR_RESOLVED (arp) event 
before moving on to the rdma_resolve_route (path record), 
and then wait for RDMA_CM_EVENT_ROUTE_RESOLVED event 
before moving on to the rdma_connect call, and then 
finally wait for RDMA_CM_EVENT_ESTABLISHED. You start
to get the picture of where my time goes? Not only do 
we have path record query delays we have a 3 step event 
processing (waiting/waking on each) just to get connected.

My measurements are on top of uDAPL so everything is equal.
I simply added some timers to dtest around connect and 
wait for connection event:

start_timer
dat_ep_connect()
dat_evd_wait()
stop_timer

For example (client side):

eth0 socket_cm:  dtest -P ofa-v2-mlx4_0-1 -h cst-55-eth0 -t 
IPoIB socket_cm: dtest -P ofa-v2-mlx4_0-1 -h cst-55-ib0 -t
rdma_cm: dtest -P ofa-v2-ib0 -h cst-55-ib0 -t


-arlin___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] How to establish IB communcation more effectively?

2009-05-12 Thread Sean Hefty
>Just to make sure we're on the same page: both IPoIB and the RDMA-CM
>use SA path queries

But ipoib caches its path records...

- Sean

___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] How to establish IB communcation more effectively?

2009-05-12 Thread Or Gerlitz
Davis, Arlin R  wrote:
> For a connection (socket connect, exchanging QP info, private data, qp modify)
> using uDAPL socket cm versus rdma_cm I get:
> socket_cm on 1Ge == ~900us
> socket_cm on IPoIB (mlx4 ddr) == ~400us
> rdma_cm on IB (mlx4 ddr) == ~2200us
> As you can see, the path record queries via rdma_cm add a substantial penalty.

Hi Arlin,

Just to make sure we're on the same page: both IPoIB and the RDMA-CM
use SA path queries (ipoib for the unicast arp reply, and rdma-cm for
rdma_resolve_route), going into details, things look like:

with the rdma-cm:

rdma_resolve_addr
  A --> *  ARP request (broadcast)
  B --> A ARP reply (unicast, before that B does SA path query)
rdma_resolve_route
  A does SA path query
rdma_connect
  A --> B CM REQ
  B --> A CM REP
  A --> B CM RTU

with the socket cm / ipoib:

socket connect
  A --> *  ARP request (broadcast)
  B --> A ARP reply (unicast, before that B does SA path query)
  A --> B TCP SYN (unicast, A does SA path query!)
  B --> A TCP SYN + ACK
  A --> B TCP ACK

Looking on the differences between the flows, we can see that --both--
flows have --two-- path queries, so the 400us vs 2200us difference
can't be related to that.So, is it possible that you have counted
rdma_create_qp in the rdma-cm accounting and didn't count
ibv_create_qp in the scm accounting?

Or.
___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [ofa-general] How to establish IB communcation more effectively?

2009-05-12 Thread Davis, Arlin R

>Hi all,
>I'm using libibverbs to build a cluster memory pool, and 
>using TCP/IP
>handshake to exchange memory information and establish the connection
>before the IB communication. While I found this process costed a lot
>of time, 100ms in 1GEth LAN, so I want to use the rdma_cm or ib_ucm to
>handle the establishment. But I dont't find sample code or API
>document, is there anything I missed?
>BTW, how to establish communication in current OFED? Any 
>comparision
>or suggestion is appreciated, that will help me a lot.
>

What scale are you targeting?

Your single connection number seems high. For a connection
(socket connect, exchanging QP info, private data, qp modify)
using uDAPL socket cm versus rdma_cm I get:

socket_cm on 1Ge == ~900us
socket_cm on IPoIB (mlx4 ddr) == ~400us
rdma_cm on IB (mlx4 ddr) == ~2200us

As you can see, the path record queries via rdma_cm add 
a substantial penalty. With larger scale clusters this
really starts to hurt.

You can look at uDAPL (dapl/openib_cma and dapl/openib_scm) 
source for examples of a socket cm implementation vs rdma_cm. 
With the socket cm version we ran up to 14,400 cores with 
no problems using Intel MPI. However, with rdma_cm we 
had problems reaching 1000 cores due to IPoIB ARP storms and
SA path record query issues. If someone would step up and 
provide a scalable SA caching solution in OFED then rdma_cm 
could possibly work for us again. Any takers? :^)

-arlin

___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general