RE: [ofa-general] How to establish IB communcation more effectively?
>Yes, of-course. But, to start with, lets analyze the case of each node >running --one-- rank and then take it from there to the case where >each node runs C ranks. The caching is independent of running MPI though. To get a fair comparison, you'd probably have to reboot the entire cluster before running the test and ensure that no other communication between the nodes occurs over ipoib. For myself, I'm not sure that the tests are the same. The DAPL providers create and modify the QPs differently. I'd need to walk through the code to see whether QP creation time is included and verify that the QP modify calls are the same. As for responding to the initial question, using sockets with hard-coded values seems to be the most common way to establish IB connections at scale, though I would guess that using the ib_cm with hard-coded values would work about the same. - Sean ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] How to establish IB communcation more effectively?
Davis, Arlin R wrote: >>Just to make sure we're on the same page: both IPoIB and the RDMA-CM >>use SA path queries (ipoib for the unicast arp reply, and rdma-cm for >>rdma_resolve_route), going into details, things look like: > I am running IPoIB connected so I assume there is no path query > and I see no difference in IPoIB unconnected mode so I also assume > it caches path records during ARP processing. Can someone confirm? Arlin, Both the datagram and connected mode issue path query (its the way IB works). The datagram mode uses the IB UD (Unreliable Datagram) transport and once the path is resolve it creates IB AH (Address Handle) which is used in conjunction with the UD QP. The connected mode uses the IB RC (Reliable Connection) transport, so path info is used to establish it connection through the IB CM. > ARP cache is also hit in all these cases so you can take ARP request/reply > out. I am not with you: by "ARP cache" I assume you refer to the networking stack neighbour table, correct? so this cache has the entries since the IPoIB network was also used to spawn the job? > However, with rdma_cm we actually have to pick up the ADDR_RESOLVED (arp) > event before moving on to the rdma_resolve_route (path record), and then wait > for > ROUTE_RESOLVED event before moving on to the rdma_connect call, and then > finally wait for ESTABLISHED. You start to get the picture of where my time > goes? > Not only do we have path record query delays we have a 3 step event > processing (waiting/waking on each) just to get connected. Yes, this sounds like a potentially big difference from the TCP case, lets see how many kernel --> user events we have in both methods -- rdma-cm active side --- addr-resolved route-resolved established rdma-cm passive side -- connection-request established scm active side -- connected scm passive side connection request connected in the rdma-cm framework there are three kernel -->user transitions/events for the active and two for the passive, where in the scm framework there are two for the passive but only one for the active. Also counting user --> kernel transitions, in the rdma-cm active side there are three vs only one in the scm. This sounds like where things would probably makes a difference. I believe it could be fairly easy to have the kernel rdma ucm module do two successive calls (route resolve and connect) once the local address is resolved, since at that point the user space consumer can create their QP, etc. > Not only do we have path record query delays So we agree that its path query --delays-- and for one rank per node its the same # of path queries? (Sean) Or. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] How to establish IB communcation more effectively?
>>Just to make sure we're on the same page: both IPoIB and the RDMA-CM >>use SA path queries > But ipoib caches its path records... Yes, of-course. But, to start with, lets analyze the case of each node running --one-- rank and then take it from there to the case where each node runs C ranks. Or. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] How to establish IB communcation more effectively?
>Davis, Arlin R wrote: >> For a connection (socket connect, exchanging QP info, >private data, qp modify) >> using uDAPL socket cm versus rdma_cm I get: >> socket_cm on 1Ge == ~900us >> socket_cm on IPoIB (mlx4 ddr) == ~400us >> rdma_cm on IB (mlx4 ddr) == ~2200us >> As you can see, the path record queries via rdma_cm add a >substantial penalty. > >Hi Arlin, > >Just to make sure we're on the same page: both IPoIB and the RDMA-CM >use SA path queries (ipoib for the unicast arp reply, and rdma-cm for >rdma_resolve_route), going into details, things look like: I am running IPoIB connected so I assume there is no path query and I see no difference in IPoIB unconnected mode so I also assume it caches path records during ARP processing. Can someone confirm? ARP cache is also hit in all these cases so you can take ARP request/reply out. However, with rdma_cm we actually have to pick up the RDMA_CM_EVENT_ADDR_RESOLVED (arp) event before moving on to the rdma_resolve_route (path record), and then wait for RDMA_CM_EVENT_ROUTE_RESOLVED event before moving on to the rdma_connect call, and then finally wait for RDMA_CM_EVENT_ESTABLISHED. You start to get the picture of where my time goes? Not only do we have path record query delays we have a 3 step event processing (waiting/waking on each) just to get connected. My measurements are on top of uDAPL so everything is equal. I simply added some timers to dtest around connect and wait for connection event: start_timer dat_ep_connect() dat_evd_wait() stop_timer For example (client side): eth0 socket_cm: dtest -P ofa-v2-mlx4_0-1 -h cst-55-eth0 -t IPoIB socket_cm: dtest -P ofa-v2-mlx4_0-1 -h cst-55-ib0 -t rdma_cm: dtest -P ofa-v2-ib0 -h cst-55-ib0 -t -arlin___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] How to establish IB communcation more effectively?
>Just to make sure we're on the same page: both IPoIB and the RDMA-CM >use SA path queries But ipoib caches its path records... - Sean ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] How to establish IB communcation more effectively?
Davis, Arlin R wrote: > For a connection (socket connect, exchanging QP info, private data, qp modify) > using uDAPL socket cm versus rdma_cm I get: > socket_cm on 1Ge == ~900us > socket_cm on IPoIB (mlx4 ddr) == ~400us > rdma_cm on IB (mlx4 ddr) == ~2200us > As you can see, the path record queries via rdma_cm add a substantial penalty. Hi Arlin, Just to make sure we're on the same page: both IPoIB and the RDMA-CM use SA path queries (ipoib for the unicast arp reply, and rdma-cm for rdma_resolve_route), going into details, things look like: with the rdma-cm: rdma_resolve_addr A --> * ARP request (broadcast) B --> A ARP reply (unicast, before that B does SA path query) rdma_resolve_route A does SA path query rdma_connect A --> B CM REQ B --> A CM REP A --> B CM RTU with the socket cm / ipoib: socket connect A --> * ARP request (broadcast) B --> A ARP reply (unicast, before that B does SA path query) A --> B TCP SYN (unicast, A does SA path query!) B --> A TCP SYN + ACK A --> B TCP ACK Looking on the differences between the flows, we can see that --both-- flows have --two-- path queries, so the 400us vs 2200us difference can't be related to that.So, is it possible that you have counted rdma_create_qp in the rdma-cm accounting and didn't count ibv_create_qp in the scm accounting? Or. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [ofa-general] How to establish IB communcation more effectively?
>Hi all, >I'm using libibverbs to build a cluster memory pool, and >using TCP/IP >handshake to exchange memory information and establish the connection >before the IB communication. While I found this process costed a lot >of time, 100ms in 1GEth LAN, so I want to use the rdma_cm or ib_ucm to >handle the establishment. But I dont't find sample code or API >document, is there anything I missed? >BTW, how to establish communication in current OFED? Any >comparision >or suggestion is appreciated, that will help me a lot. > What scale are you targeting? Your single connection number seems high. For a connection (socket connect, exchanging QP info, private data, qp modify) using uDAPL socket cm versus rdma_cm I get: socket_cm on 1Ge == ~900us socket_cm on IPoIB (mlx4 ddr) == ~400us rdma_cm on IB (mlx4 ddr) == ~2200us As you can see, the path record queries via rdma_cm add a substantial penalty. With larger scale clusters this really starts to hurt. You can look at uDAPL (dapl/openib_cma and dapl/openib_scm) source for examples of a socket cm implementation vs rdma_cm. With the socket cm version we ran up to 14,400 cores with no problems using Intel MPI. However, with rdma_cm we had problems reaching 1000 cores due to IPoIB ARP storms and SA path record query issues. If someone would step up and provide a scalable SA caching solution in OFED then rdma_cm could possibly work for us again. Any takers? :^) -arlin ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
