On Dec 6, 2007, at 9:54 AM, Durga Choudhury wrote:
Automatically striping large messages across multiple NICs is
certainly a very nice feature; I was not aware that OpenMPI does
this transparently. (I wonder if other MPI implementations do this
or not). However, I have the following concern: Since the
communication over an ethernet NIC is most likely over IP, does it
take into account the route cost when striping messages? For
example, host A and B in the MPD ring might be connected via two
NICs, one direct and one via an intermediate router, or one with a
large bandwidth and another with a small bandwidth. Does OpenMPI
send a smaller chunk of data over a route with a higher cost?
Not unless you tell it.
In IB networks, the network API exposes bandwidth differences of the
NIC and Open MPI takes that into account by deciding how much data to
send down each endpoint. Open MPI does not currently know anything
about / try to optimize based on the costs of different routes.
On a TCP network, whether you go through 2 or 3 switches -- does it
really matter? The latency is so high that adding another switch (or
2 or 3 or ...) may not make much of a difference anyway. Raw
bandwidth differences between two networks will make a difference, but
number of hops -- as long as they're not *too* difference -- might not.
Also consider: if you're combining 100Mbps and 1Gbps ethernet networks
-- is it really worth it? If your goal is simple bandwidth addition,
note that you're adding a fraction of the capability to the 1Gbps
network at the cost of additional complexity in your software and/or
fragmentation reassembly penalties. Will you really see more
delivered bandwidth? It's probably dependent upon your application
(e.g., are you continually sending very large messages?). You might
get much more bang for your buck if you combine like networks (e.g.,
2x100Mbps or 2x1Gbps) because you'll be [potentially] doubling your
bandwidth.
Because of this concern, I think the channel bonding approach
someone else suggested is more preferable; all these details will be
taken care of at the hardware level instead of at the IP level.
That's not quite true. Both approaches are handled in software; one
is in the kernel, the other is in the middleware. The hardware is
unaware that you are striping large messages.
--
Jeff Squyres
Cisco Systems