Re: [OMPI users] MPI_Bcast issue

Randolph Pullen Tue, 10 Aug 2010 22:09:23 -0400

Jeff thanks for the clarification,
What I am trying to do is run N concurrent copies of a 1 to N data movement 
program to affect an N to N solution.  The actual mechanism I am using is to 
spawn N copies of mpirun from PVM across the cluster. So each 1 to N MPI 
application starts at the same time with a different node as root.


Yes I know this is a bit odd…  It was an attempt to be lazy and not re-write 
the code (again) and this appears to be a potential log N solution.

My thoughts are that the problem must be either:

1)    Some bug in my code that does not occur normally (this seems unlikely 
because it halts in Bcast and runs in the normal 1 to N manner)
2)    Something in MPI is fouling the bcast call
3)    Something in PVM is fouling the bcast call

Obviously, this is not the PVM forum, but have I missed anything?

I’m not a network expert and I had assumed broadcasts must be implemented with  
multicasts to prevent broadcasts colliding between concurrent applications and 
thus conforming to the MPI safety/isolation dictum (this appears to be how PVM 
isolates broadcasts between applications).  I can see now that a series of 
point to point send/receives should not be affected in this way.

This is what I would describe as “implementing Bcast with point to point 
transfers”  as opposed to “implementing Bcast with broadcasts”

BTW: Implementing Bcast with Multicast or a combination of both multicasts and 
p2p transfers is another option and described by Hoefler et. al. in their paper 
“A practically constant-time MPI Broadcast Algorithm for large-scale InﬁniBand 
Clusters with Multicast”.

I guess I was eluding that if MPI used actual broadcast calls to transmit or 
syncronise, then a broadcast collision may be possible.  I did not know that 
only p2p transfers where used in Bcast calls.

>From here I need to decide to:

1)    Generate a minimal example but given that this will require PVM, it is 
unlikely to see much use.

2)    Write a N to N transfer system in MPI using inter-communicators, however 
this may not scale with only p2p transfers and is probably N Log N at best.

3)    Write the N to N transfer system in PVM, Open Fabric calls or something 
that supports broadcast/multicast calls.

My application must transfer a large (potentially huge) amount of tuples from a 
table distributed across the cluster to a table replicated on each node.  The 
similar (1 to N) system compresses tuples into 64k pages and sends these.  The 
same method would be used and the page size could be varied for efficiency.

What are your thoughts?  Can OpenMPI do this in under N log N time?

Regards,
Randolph


--- On Wed, 11/8/10, Jeff Squyres <jsquy...@cisco.com> wrote:

From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] MPI_Bcast issue
To: "Open MPI Users" <us...@open-mpi.org>
Received: Wednesday, 11 August, 2010, 6:24 AM

+1 on Eugene's comment that I don't fully understand what you are trying to do. 
 Can you send a short example code?

Some random points:

- Edgar already chimed in about how MPI-2 allows the use of intercommunicators 
with bcast. Open MPI is MPI-2.1 complaint, so you can use intercommunicators 
with MPI_Bcast.

- I'm not sure what you mean by "implement broadcast with broadcast" -- that 
seems like a recursive definition...?

- Keep in mind that there are MPI standard-imposed limitations of how MPI_Bcast 
can function.  What you *may* be running afoul of is MPI specifications and 
definitions -- not a bug in OMPI.  But that's not entirely clear to me because 
I don't quite understand what you're trying to do.

- For example, remember that MPI 2.x defines that you can only have one ongoing 
collective on a communicator at a time.  So if you're starting multiple bcasts 
on the same communicator simultaneously (effectively by using different root 
values in different processes on the same communicator), this is Bad.  Dick 
intoned that you probably aren't doing that, but again, I'm not entirely sure 
what you're doing.

- Also, by the MPI spec, MPI_Bcast may or may not synchronize.  In practice, if 
you're broadcasting a large message, most implementations will likely 
synchronize (where "large" is defined differently by each implementation).

- Open MPI has many different algorithms to implement the MPI_Bcast 
functionality.  Which to use is chosen on the fly behind the scenes at run time 
depending on lots of things, such as number of peers in the communicator, size 
of the message, etc.

- All of Open MPI's network-based collectives use point-to-point communications 
underneath (shared memory may not, but that's not the issue here).  One of the 
implementations is linear, meaning that the root sends the message to comm rank 
1, then comm rank 2, ..etc.  But this algorithm is only used when the message 
is small, the number of peers is small, etc.  All the other algorithms are 
parallel in nature, meaning that after an iteration or two, multiple processes 
have the data and can start pipelining sends to other processes, etc.

- We don't have a multicast-based broadcast for a variety of reasons.  
MPI_Bcast needs to be reliable.  Multicast is not reliable.  There have been 
many good algorithms published over the years to make unreliable multicast be 
reliable, but no one has implemented those in a robust, production-quality 
manner for Open MPI.  Part of the reason for that is the non-uniform support of 
robust multicast implementations by network vendors, the lack of spanning 
multicast across multiple subnets, etc.  In practice, the log(n) algorithms 
that Open MPI uses have generally been "fast enough" such that there hasn't 
been a clamor for a multicast-based broadcast.  To be fair: every once in a 
(great) while, someone says they need it, but to be totally blunt, a) we 
haven't received enough requests to implement it ourselves, or b) no one has 
contributed a patch / plugin that implements it.  That sounds snobby, but I 
don't mean it that way: what I mean is that most of
 the features in Open MPI are customer-driven.  All I'm saying is that we have 
a lot of other higher-priority customer-requested features that we're working 
on.  Multicast-bcast support is not high enough in priority because not enough 
people have asked for it.

I hope that helps...



On Aug 9, 2010, at 10:43 PM, Randolph Pullen wrote:

> The install was completly vanilla - no extras a plain .configure command line 
> (on FC10 x8x_64 linux)
> 
> Are you saying that all broadcast calls are actually implemented as serial 
> point to point calls?
> 
> 
> --- On Tue, 10/8/10, Ralph Castain <r...@open-mpi.org> wrote:
> 
> From: Ralph Castain <r...@open-mpi.org>
> Subject: Re: [OMPI users] MPI_Bcast issue
> To: "Open MPI Users" <us...@open-mpi.org>
> Received: Tuesday, 10 August, 2010, 12:33 AM
> 
> No idea what is going on here. No MPI call is implemented as a multicast - it 
> all flows over the MPI pt-2-pt system via one of the various algorithms.
> 
> Best guess I can offer is that there is a race condition in your program that 
> you are tripping when other procs that share the node change the timing.
> 
> How did you configure OMPI when you built it?
> 
> 
> On Aug 8, 2010, at 11:02 PM, Randolph Pullen wrote:
> 
>> The only MPI calls I am using are these (grep-ed from my code):
>> 
>> MPI_Abort(MPI_COMM_WORLD, 1);
>> MPI_Barrier(MPI_COMM_WORLD);
>> MPI_Bcast(&bufarray[0].hdr, sizeof(BD_CHDR), MPI_CHAR, 0, MPI_COMM_WORLD);
>> MPI_Comm_rank(MPI_COMM_WORLD,&myid);
>> MPI_Comm_size(MPI_COMM_WORLD,&numprocs); 
>> MPI_Finalize();
>> MPI_Init(&argc, &argv);
>> MPI_Irecv(
>> MPI_Isend(
>> MPI_Recv(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD, &stat);
>> MPI_Send(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD);
>> MPI_Test(&request, &complete, &status);
>> MPI_Wait(&request, &status);  
>> 
>> The big wait happens on receipt of a bcast call that would otherwise work.
>> Its a bit mysterious really...
>> 
>> I presume that bcast is implemented with multicast calls but does it use any 
>> actual broadcast calls at all?  
>> I know I'm scraping the edges here looking for something but I just cant get 
>> my head around why it should fail where it has.
>> 
>> --- On Mon, 9/8/10, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> From: Ralph Castain <r...@open-mpi.org>
>> Subject: Re: [OMPI users] MPI_Bcast issue
>> To: "Open MPI Users" <us...@open-mpi.org>
>> Received: Monday, 9 August, 2010, 1:32 PM
>> 
>> Hi Randolph
>> 
>> Unless your code is doing a connect/accept between the copies, there is no 
>> way they can cross-communicate. As you note, mpirun instances are completely 
>> isolated from each other - no process in one instance can possibly receive 
>> information from a process in another instance because it lacks all 
>> knowledge of it -unless- they wireup into a greater communicator by 
>> performing connect/accept calls between them.
>> 
>> I suspect you are inadvertently doing just that - perhaps by doing 
>> connect/accept in a tree-like manner, not realizing that the end result is 
>> one giant communicator that now links together all the N servers.
>> 
>> Otherwise, there is no possible way an MPI_Bcast in one mpirun can collide 
>> or otherwise communicate with an MPI_Bcast between processes started by 
>> another mpirun.
>> 
>> 
>> 
>> On Aug 8, 2010, at 7:13 PM, Randolph Pullen wrote:
>> 
>>> Thanks,  although “An intercommunicator cannot be used for collective 
>>> communication.” i.e ,  bcast calls., I can see how the MPI_Group_xx calls 
>>> can be used to produce a useful group and then communicator;  - thanks 
>>> again but this is really the side issue to my main question about MPI_Bcast.
>>> 
>>> I seem to have duplicate concurrent processes interfering with each other.  
>>> This would appear to be a breach of the MPI safety dictum, ie MPI_COMM_WORD 
>>> is supposed to only include the processes started by a single mpirun 
>>> command and isolate these processes from other similar groups of processes 
>>> safely.
>>> 
>>> So, it would appear to be a bug.  If so this has significant implications 
>>> for environments such as mine, where it may often occur that the same 
>>> program is run by different users simultaneously.  
>>> 
>>> It is really this issue that it concerning me, I can rewrite the code but 
>>> if it can crash when 2 copies run at the same time, I have a much bigger 
>>> problem.
>>> 
>>> My suspicion is that a within the MPI_Bcast handshaking, a syncronising 
>>> broadcast call may be colliding across the environments.  My only evidence 
>>> is an otherwise working program waits on broadcast reception forever when 
>>> two or more copies are run at [exactly] the same time.
>>> 
>>> Has anyone else seen similar behavior in concurrently running programs that 
>>> perform lots of broadcasts perhaps?
>>> 
>>> Randolph
>>> 
>>> 
>>> --- On Sun, 8/8/10, David Zhang <solarbik...@gmail.com> wrote:
>>> 
>>> From: David Zhang <solarbik...@gmail.com>
>>> Subject: Re: [OMPI users] MPI_Bcast issue
>>> To: "Open MPI Users" <us...@open-mpi.org>
>>> Received: Sunday, 8 August, 2010, 12:34 PM
>>> 
>>> In particular, intercommunicators
>>> 
>>> On 8/7/10, Aurélien Bouteiller <boute...@eecs.utk.edu> wrote:
>>> > You should consider reading about communicators in MPI.
>>> >
>>> > Aurelien
>>> > --
>>> > Aurelien Bouteiller, Ph.D.
>>> > Innovative Computing Laboratory, The University of Tennessee.
>>> >
>>> > Envoyé de mon iPad
>>> >
>>> > Le Aug 7, 2010 à 1:05, Randolph Pullen <randolph_pul...@yahoo.com.au> a
>>> > écrit :
>>> >
>>> >> I seem to be having a problem with MPI_Bcast.
>>> >> My massive I/O intensive data movement program must broadcast from n to n
>>> >> nodes. My problem starts because I require 2 processes per node, a sender
>>> >> and a receiver and I have implemented these using MPI processes rather
>>> >> than tackle the complexities of threads on MPI.
>>> >>
>>> >> Consequently, broadcast and calls like alltoall are not completely
>>> >> helpful.  The dataset is huge and each node must end up with a complete
>>> >> copy built by the large number of contributing broadcasts from the 
>>> >> sending
>>> >> nodes.  Network efficiency and run time are paramount.
>>> >>
>>> >> As I don’t want to needlessly broadcast all this data to the sending 
>>> >> nodes
>>> >> and I have a perfectly good MPI program that distributes globally from a
>>> >> single node (1 to N), I took the unusual decision to start N copies of
>>> >> this program by spawning the MPI system from the PVM system in an effort
>>> >> to get my N to N concurrent transfers.
>>> >>
>>> >> It seems that the broadcasts running on concurrent MPI environments
>>> >> collide and cause all but the first process to hang waiting for their
>>> >> broadcasts.  This theory seems to be confirmed by introducing a sleep of
>>> >> n-1 seconds before the first MPI_Bcast  call on each node, which results
>>> >> in the code working perfectly.  (total run time 55 seconds, 3 nodes,
>>> >> standard TCP stack)
>>> >>
>>> >> My guess is that unlike PVM, OpenMPI implements broadcasts with 
>>> >> broadcasts
>>> >> rather than multicasts.  Can someone confirm this?  Is this a bug?
>>> >>
>>> >> Is there any multicast or N to N broadcast where sender processes can
>>> >> avoid participating when they don’t need to?
>>> >>
>>> >> Thanks in advance
>>> >> Randolph
>>> >>
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> users mailing list
>>> >> us...@open-mpi.org
>>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> >
>>> 
>>> -- 
>>> Sent from my mobile device
>>> 
>>> David Zhang
>>> University of California, San Diego
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>>  _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> -----Inline Attachment Follows-----
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>>  _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -----Inline Attachment Follows-----
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
>  _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] MPI_Bcast issue

Reply via email to