Hello boys and girls.  I just wanted to drop a line and give you an update.

First of all, my simple question:
In what files can I find the source code for "mca_oob.oob_send" and "mca_oob.oob_recv"? I'm having a hard time following the initialization code that populates the struct of callbacks.

Next, the context of the question:
I've been trying to find a way to make a plain old process start and then participate in an MPI Group spread across a cluster. Let me try to use the local dialect and express my goal in terms I am likely to misuse: I want to make a singleton MPI process spawn and establish an intercommunicator with another MPI world.

Here's the list of things that have not worked:

Using MPI_Comm_spawn -- I've been told this is working in the 1.3 cvs snapshots, but not in any stable release. The symptom is that the call to MPI_Comm_spawn complains about not having a hostfile. For the full history, see ompi-users thread "How to specify hosts for MPI_Comm_spawn" for details.

Forking the parent process *before* it enters any MPI calls ( to hopefully avoid environmental pitfalls Jeff Squyres warned of). Parent process calls MPI_Init to become the MPI singleton, then tries to establish an intercommunicator with the MPI group that is getting spawned at the same time. Forked child processes overlays the process of mpirun via execlp to start a "normal" MPI group. I've tried two different methods for establishing the intercomm. Both methods hang indefinitely and use lots of cpu doing nothing. Fork Method 1: MPI_Open_port+ MPI_Comm_accept on one side, MPI_Comm_connect on the other. The two sides hang in the MPI_Comm_accept and MPI_Comm_connect. I did not pursue it deeper than that.

Fork Method 2: tcp socket establishment, followed by MPI_Comm_join on both sides. Both sides hang in the call to MPI_Comm_join. Upon further inspection and code-hacking, I've determined they can successfully trade names "0.0.0" and "0.1.0" and both sides then call ompi_comm_connect_accept. Inside omp_comm_connect_accept, both sides call orte_rml.send_buffer; one side finishes the call, while the other gets blocked inside oob_send. The side that did not get blocked moves on to call orte_rml.recv_buffer . It gets blocked inside oob_recv.


OOB == Out of band sockets?  If so, why?

Reply via email to