Reopening this thread. In searching another problem I ran across this one in a 
different context. Turns out there really is a bug here that needs to be 
addressed.

I'll try to tackle it this weekend - will update you when done.


On Jun 25, 2010, at 7:23 AM, Philippe wrote:

> Hi,
> 
> I'm trying to run a test program which consists of a server creating a
> port using MPI_Open_port and N clients using MPI_Comm_connect to
> connect to the server.
> 
> I'm able to do so with 1 server and 2 clients, but with 1 server + 3
> clients, I get the following error message:
> 
>   [node003:32274] [[37084,0],0]:route_callback tried routing message
> from [[37084,1],0] to [[40912,1],0]:102, can't find route
> 
> This is only happening with the openib BTL. With tcp BTL it works
> perfectly fine (ofud also works as a matter of fact...). This has been
> tested on two completely different clusters, with identical results.
> In either cases, the IB frabic works normally.
> 
> Any help would be greatly appreciated! Several people in my team
> looked at the problem. Google and the mailing list archive did not
> provide any clue. I believe that from an MPI standpoint, my test
> program is valid (and it works with TCP, which make me feel better
> about the sequence of MPI calls)
> 
> Regards,
> Philippe.
> 
> 
> 
> Background:
> 
> I intend to use openMPI to transport data inside a much larger
> application. Because of that, I cannot used mpiexec. Each process is
> started by our own "job management" and use a name server to find
> about each others. Once all the clients are connected, I would like
> the server to do MPI_Recv to get the data from all the client. I dont
> care about the order or which client are sending data, as long as I
> can receive it with on call. Do do that, the clients and the server
> are going through a series of Comm_accept/Conn_connect/Intercomm_merge
> so that at the end, all the clients and the server are inside the same
> intracomm.
> 
> Steps:
> 
> I have a sample program that show the issue. I tried to make it as
> short as possible. It needs to be executed on a shared file system
> like NFS because the server write the port info to a file that the
> client will read. To reproduce the issue, the following steps should
> be performed:
> 
> 0. compile the test with "mpicc -o ben12 ben12.c"
> 1. ssh to the machine that will be the server
> 2. run ./ben12 3 1
> 3. ssh to the machine that will be the client #1
> 4. run ./ben12 3 0
> 5. repeat step 3-4 for client #2 and #3
> 
> the server accept the connection from client #1 and merge it in a new
> intracomm. It then accept connection from client #2 and merge it. when
> the client #3 arrives, the server accept the connection, but that
> cause client #1 and #2 to die with the error above (see the complete
> trace in the tarball).
> 
> The exact steps are:
> 
>     - server open port
>     - server does accept
>     - client #1 does connect
>     - server and client #1 do merge
>     - server does accept
>     - client #2 does connect
>     - server, client #1 and client #2 do merge
>     - server does accept
>     - client #3 does connect
>     - server, client #1, client #2 and client #3 do merge
> 
> 
> My infiniband network works normally with other test programs or
> applications (MPI or others like Verbs).
> 
> Info about my setup:
> 
>    openMPI version = 1.4.1 (I also tried 1.4.2, nightly snapshot of
> 1.4.3, nightly snapshot of 1.5 --- all show the same error)
>    config.log in the tarball
>    "ompi_info --all" in the tarball
>    OFED version = 1.3 installed from RHEL 5.3
>    Distro = RedHat Entreprise Linux 5.3
>    Kernel = 2.6.18-128.4.1.el5 x86_64
>    subnet manager = built-in SM from the cisco/topspin switch
>    output of ibv_devinfo included in the tarball (there are no "bad" nodes)
>    "ulimit -l" says "unlimited"
> 
> The tarball contains:
> 
>   - ben12.c: my test program showing the behavior
>   - config.log / config.out / make.out / make-install.out /
> ifconfig.txt / ibv-devinfo.txt / ompi_info.txt
>   - trace-tcp.txt: output of the server and each client when it works
> with TCP (I added "btl = tcp,self" in ~/.openmpi/mca-params.conf)
>   - trace-ib.txt: output of the server and each client when it fails
> with IB (I added "btl = openib,self" in ~/.openmpi/mca-params.conf)
> 
> I hope I provided enough info for somebody to reproduce the problem...
> <ompi-output.tar.bz2>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to