Hi OpenMPI_Users and OpenMPI_Developers, I'm unable to connect a client application using MPI_Comm_connect() to a server job (the server job calls MPI_Open_port() before calling by MPI_Comm_accept()) when the server job uses MX MTL (although it works without problems when the server uses MX BTL). The server job runs on a cluster connected to a Myrinet 10G network (MX 1.2.11) in addition to an ordinary Ethernet network. The client runs on a different machine, not connected to the Myrinet network but accessible via the Ethernet network.
Joined to this message are the simple server and client programs (87 lines total) called simpleserver.c and simpleclient.c . Note we are using OpenMPI 1.4.2 on x86_64 Linux (server: Fedora 7 client: Fedora 12). Compiling these programs with mpicc on the server front node (fn1) and client workstation (linux15) works well: [audet@fn1 bench]$ mpicc simpleserver.c -o simpleserver [audet@linux15 mpi]$ mpicc simpleclient.c -o simpleclient Then if we start the server on the cluster (job is started on cluster node cn18) and asking to use MTL : [audet@fn1 bench]$ mpiexec -x MX_RCACHE=2 -machinefile machinefile_cn18 --mca mtl mx --mca pml cm -n 1 ./simpleserver It prints the server port (Note we uses MX_RCACHE=2 to avoid a warning but it doesn't affect the current issue) : Server port = '3548905472.0;tcp://172.17.15.20:39517+3548905473.0;tcp://172.17.10.18:47427:300' Then starting the client on the workstation with this port number: [audet@linux15 mpi]$ mpiexec -n 1 ./simpleclient '3548905472.0;tcp://172.17.15.20:39517+3548905473.0;tcp://172.17.10.18:47427:300' The server process core dump as follow: MPI_Comm_accept() sucessful... [cn18:24582] *** Process received signal *** [cn18:24582] Signal: Segmentation fault (11) [cn18:24582] Signal code: Address not mapped (1) [cn18:24582] Failing at address: 0x38 [cn18:24582] [ 0] /lib64/libpthread.so.0 [0x305de0dd20] [cn18:24582] [ 1] /usr/local/openmpi-1.4.2/lib/openmpi/mca_mtl_mx.so [0x2aaaad6a7e6d] [cn18:24582] [ 2] /usr/local/openmpi-1.4.2/lib/openmpi/mca_pml_cm.so [0x2aaaad4a319d] [cn18:24582] [ 3] /usr/local/openmpi/lib/libmpi.so.0(ompi_dpm_base_disconnect_init+0xbf) [0x2aaaaab1403f] [cn18:24582] [ 4] /usr/local/openmpi-1.4.2/lib/openmpi/mca_dpm_orte.so [0x2aaaaed0eb19] [cn18:24582] [ 5] /usr/local/openmpi/lib/libmpi.so.0(PMPI_Comm_disconnect+0xa0) [0x2aaaaaaf4f20] [cn18:24582] [ 6] ./simpleserver(main+0x14c) [0x400d04] [cn18:24582] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x305ce1daa4] [cn18:24582] [ 8] ./simpleserver [0x400b09] [cn18:24582] *** End of error message *** -------------------------------------------------------------------------- mpiexec noticed that process rank 0 with PID 24582 on node cn18 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- [audet@fn1 bench]$ And the client stops with the following error message: -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[31386,1],0]) is on host: linux15 Process 2 ([[54152,1],0]) is on host: cn18 BTLs attempted: self sm tcp Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- MPI_Comm_connect() sucessful... Error in comm_disconnect_waitall [audet@linux15 mpi]$ I really don't understand this message because the client can connect with the server using tcp on Ethernet. Moreover if I add MCA options when I start the server to include TCP BTL, the same problems happens (the argument list then becomes: '--mca mtl mx --mca pml cm --mca btl tcp,shared,self' ). However if I remove all MCA options when I start the server (e.g. when BTL MX is used), no such problems appears. Everything goes fine also if I start the server with an explicit request to use BTL MX and TCP (e.g. with options '--mca btl mx,tcp,sm,self'). Four running our server application we really prefer to use MX MTL over MX BTL since it is much faster with MTL (although the usual ping pong test is only slightly faster with MTL). Enclosed also the output of ompi_info --all runned on the cluster node (cn18) and the workstation (linux15). Please help me. I think my problem is only a question of wrong MCA parameters (which is obscure for me). Thanks, Martin Audet, Research Officer Industrial Material Institute National Research Council of Canada 75 de Mortagne, Boucherville, QC, J4B 6Y4, Canada