On Dec 6, 2007, at 9:54 AM, Durga Choudhury wrote:

Automatically striping large messages across multiple NICs is certainly a very nice feature; I was not aware that OpenMPI does this transparently. (I wonder if other MPI implementations do this or not). However, I have the following concern: Since the communication over an ethernet NIC is most likely over IP, does it take into account the route cost when striping messages? For example, host A and B in the MPD ring might be connected via two NICs, one direct and one via an intermediate router, or one with a large bandwidth and another with a small bandwidth. Does OpenMPI send a smaller chunk of data over a route with a higher cost?

Not unless you tell it.

In IB networks, the network API exposes bandwidth differences of the NIC and Open MPI takes that into account by deciding how much data to send down each endpoint. Open MPI does not currently know anything about / try to optimize based on the costs of different routes.

On a TCP network, whether you go through 2 or 3 switches -- does it really matter? The latency is so high that adding another switch (or 2 or 3 or ...) may not make much of a difference anyway. Raw bandwidth differences between two networks will make a difference, but number of hops -- as long as they're not *too* difference -- might not.

Also consider: if you're combining 100Mbps and 1Gbps ethernet networks -- is it really worth it? If your goal is simple bandwidth addition, note that you're adding a fraction of the capability to the 1Gbps network at the cost of additional complexity in your software and/or fragmentation reassembly penalties. Will you really see more delivered bandwidth? It's probably dependent upon your application (e.g., are you continually sending very large messages?). You might get much more bang for your buck if you combine like networks (e.g., 2x100Mbps or 2x1Gbps) because you'll be [potentially] doubling your bandwidth.

Because of this concern, I think the channel bonding approach someone else suggested is more preferable; all these details will be taken care of at the hardware level instead of at the IP level.

That's not quite true. Both approaches are handled in software; one is in the kernel, the other is in the middleware. The hardware is unaware that you are striping large messages.

--
Jeff Squyres
Cisco Systems

Reply via email to