On Aug 10, 2010, at 10:09 PM, Randolph Pullen wrote:

> Jeff thanks for the clarification,
> What I am trying to do is run N concurrent copies of a 1 to N data movement 
> program to affect an N to N solution.  The actual mechanism I am using is to 
> spawn N copies of mpirun from PVM across the cluster. So each 1 to N MPI 
> application starts at the same time with a different node as root.

You mentioned that each root has a large amount of data to broadcast.  How 
large?  

Have you done back-of-the-envelope kinds of calculations to determine if you're 
hitting link contention kinds of limits -- e.g., would running a series of N/M 
broadcasts sequentially actually result in a net speedup (vs. running all N 
broadcasts simultaneously) because of lack of network congestion / contention?

If the messages are as large as you imply, then link contention must be taken 
into account of overall performance, particularly if you're using more than 
just a handful of nodes.

> Yes I know this is a bit odd…  It was an attempt to be lazy and not re-write 
> the code (again) and this appears to be a potential log N solution.

I'm not sure I understand that statement -- why would this be a log(n) solution 
if everyone is broadcasting simultaneously? (and therefore each root is 
assumedly using most/all available send bandwidth from its link)

> My thoughts are that the problem must be either:
> 
> 1)    Some bug in my code that does not occur normally (this seems unlikely 
> because it halts in Bcast and runs in the normal 1 to N manner)
> 2)    Something in MPI is fouling the bcast call
> 3)    Something in PVM is fouling the bcast call
> 
> Obviously, this is not the PVM forum, but have I missed anything?

A fourth possibility is that the network is dropping something that it 
shouldn't be (with high link contention, this is possible).  You haven't 
mentioned, but I'm assuming that you're running over ethernet -- perhaps you're 
running into TCP drops and therefore (very long) TCP retransmit timeouts.

If you want to remove PVM from the equation, you could mpirun a trivial 
bootstrap application across all your nodes that, on each MCW rank process, 
calls MPI_COMM_SPAWN on MPI_COMM_SELF for the broadcast that is supposed to be 
rooted on that node.

> BTW: Implementing Bcast with Multicast or a combination of both multicasts 
> and p2p transfers is another option and described by Hoefler et. al. in their 
> paper “A practically constant-time MPI Broadcast Algorithm for large-scale 
> InfiniBand Clusters with Multicast”.

Yep; I've read it.  Torsten's a smart guy.  :-)  I'd love to see a plugin 
contributed that implements this algorithm, or one of other reliable multicast 
algorithms.

Keep in mind that if N (where N is large) roots are all transmitting very large 
multicast messages simultaneously, this is a situation where networks are free 
to drop.  In a pathological case like yours, N simultaneous multicasts may not 
perform as well as you would expect.

> From here I need to decide to:
> 
> 1)    Generate a minimal example but given that this will require PVM, it is 
> unlikely to see much use.

I think if you can write a small MPI-only example, that would be most helpful.

> 2)    Write a N to N transfer system in MPI using inter-communicators, 
> however this may not scale with only p2p transfers and is probably N Log N at 
> best.

Intercommunicators are a red herring here.  They were mentioned earlier in the 
thread because people thought you were using the MPI accept/connect model of 
joining multiple MPI processes together.  If you aren't doing that, intercomms 
are likely unnecessary.

> 3)    Write the N to N transfer system in PVM, Open Fabric calls or something 
> that supports broadcast/multicast calls.

I'm not sure if OpenFabrics verbs support multicast.  Mellanox ConnectX cards 
were supposed to do this eventually, but I don't know if that capability ever 
was finished (Cisco left the IB business a while ago, so I've stopped paying 
attention to IB developments).

> My application must transfer a large (potentially huge) amount of tuples from 
> a table distributed across the cluster to a table replicated on each node.  
> The similar (1 to N) system compresses tuples into 64k pages and sends these. 
>  The same method would be used and the page size could be varied for 
> efficiency.
> 
> What are your thoughts?  Can OpenMPI do this in under N log N time?

(Open) MPI is just a message passing library -- in terms of raw bandwidth 
transfer, it can pretty much do anything that your underlying network can do.  
Whether MPI_BCAST or MPI_ALLGATHER is the right mechanism or not is a different 
issue.

(I'll say that OMPI's ALLGATHER algorithm is probably not well optimized for 
massive data transfers like you describe)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to