Hi, I'm trying to run a test program which consists of a server creating a port using MPI_Open_port and N clients using MPI_Comm_connect to connect to the server.
I'm able to do so with 1 server and 2 clients, but with 1 server + 3 clients, I get the following error message: [node003:32274] [[37084,0],0]:route_callback tried routing message from [[37084,1],0] to [[40912,1],0]:102, can't find route This is only happening with the openib BTL. With tcp BTL it works perfectly fine (ofud also works as a matter of fact...). This has been tested on two completely different clusters, with identical results. In either cases, the IB frabic works normally. Any help would be greatly appreciated! Several people in my team looked at the problem. Google and the mailing list archive did not provide any clue. I believe that from an MPI standpoint, my test program is valid (and it works with TCP, which make me feel better about the sequence of MPI calls) Regards, Philippe. Background: I intend to use openMPI to transport data inside a much larger application. Because of that, I cannot used mpiexec. Each process is started by our own "job management" and use a name server to find about each others. Once all the clients are connected, I would like the server to do MPI_Recv to get the data from all the client. I dont care about the order or which client are sending data, as long as I can receive it with on call. Do do that, the clients and the server are going through a series of Comm_accept/Conn_connect/Intercomm_merge so that at the end, all the clients and the server are inside the same intracomm. Steps: I have a sample program that show the issue. I tried to make it as short as possible. It needs to be executed on a shared file system like NFS because the server write the port info to a file that the client will read. To reproduce the issue, the following steps should be performed: 0. compile the test with "mpicc -o ben12 ben12.c" 1. ssh to the machine that will be the server 2. run ./ben12 3 1 3. ssh to the machine that will be the client #1 4. run ./ben12 3 0 5. repeat step 3-4 for client #2 and #3 the server accept the connection from client #1 and merge it in a new intracomm. It then accept connection from client #2 and merge it. when the client #3 arrives, the server accept the connection, but that cause client #1 and #2 to die with the error above (see the complete trace in the tarball). The exact steps are: - server open port - server does accept - client #1 does connect - server and client #1 do merge - server does accept - client #2 does connect - server, client #1 and client #2 do merge - server does accept - client #3 does connect - server, client #1, client #2 and client #3 do merge My infiniband network works normally with other test programs or applications (MPI or others like Verbs). Info about my setup: openMPI version = 1.4.1 (I also tried 1.4.2, nightly snapshot of 1.4.3, nightly snapshot of 1.5 --- all show the same error) config.log in the tarball "ompi_info --all" in the tarball OFED version = 1.3 installed from RHEL 5.3 Distro = RedHat Entreprise Linux 5.3 Kernel = 2.6.18-128.4.1.el5 x86_64 subnet manager = built-in SM from the cisco/topspin switch output of ibv_devinfo included in the tarball (there are no "bad" nodes) "ulimit -l" says "unlimited" The tarball contains: - ben12.c: my test program showing the behavior - config.log / config.out / make.out / make-install.out / ifconfig.txt / ibv-devinfo.txt / ompi_info.txt - trace-tcp.txt: output of the server and each client when it works with TCP (I added "btl = tcp,self" in ~/.openmpi/mca-params.conf) - trace-ib.txt: output of the server and each client when it fails with IB (I added "btl = openib,self" in ~/.openmpi/mca-params.conf) I hope I provided enough info for somebody to reproduce the problem...
ompi-output.tar.bz2
Description: BZip2 compressed data