Re: [OMPI users] MPI_Bcast issue

Jeff Squyres Wed, 11 Aug 2010 10:11:22 -0400

On Aug 11, 2010, at 12:10 AM, Randolph Pullen wrote:

> Sure, but broadcasts are faster - less reliable apparently, but much faster 
> for large clusters.


Just to be totally clear: MPI_BCAST is defined to be "reliable", in the sense 
that it will complete or invoke an error (vs. unreliable data streams like UDP 
where sending a packet may or may not arrive at the receiver).  

I think you're saying that something in your setup does not appear to be 
functioning properly -- possibly an OMPI bug, possibly TCP timeouts, possibly 
incorrect use of MPI, possibly ...etc.  But I just wanted to disambiguate the 
meaning of the word "reliable" here.

> Jeff says that all OpenMPI calls are implemented with point to point B-tree 
> style communications of log N transmissions

Just to clarify so that I'm not mis-quoted, I said: "All of Open MPI's 
network-based collectives use point-to-point communications underneath (shared 
memory may not, but that's not the issue here)".  

1. "Collectives" means a very different thing than "all Open MPI calls".
2. Some of our algorithms are not based on binary (or binomial -- it's not 
clear what you meant) trees.

Sorry to be so pedantic -- but mis-quotes like this have been the source of 
huge misunderstandings in the past.

It is also worth noting that Open MPI's collectives are implemented with 
plugins -- there's nothing preventing a new plugin that does *not* use 
point-to-point communication calls (like the shared memory collective 
implementations, or multicast, or some other kind of hardware collective 
offload, or ...).

Indeed, I should point out that my statement was not entirely correct because 
Voltaire just recently committed the "fca" plugin to the OMPI development trunk 
(to be introduced in OMPI v1.5) that uses IB hardware offloading for MPI 
collective implementations -- see their press releases and marketing material 
for how this stuff works.  Mellanox has slightly different MPI collective IB 
hardware offloading technology for Open MPI, too.

> So I guess that altoall would be N log N

I'm not sure of the complexity of OMPI's alltoall algorithms offhand.  I see at 
least 3 algorithms after *quick* look in the OMPI source code.  They probably 
all have their own complexities, but need to be viewed in the context of when 
those algorithms allow themselves to be used (e.g., O(N) may not matter if 
there's a small number of peers with small messages).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] MPI_Bcast issue

Reply via email to