Hi OpenMPI_Users and OpenMPI_Developers,

I'm unable to connect a client application using MPI_Comm_connect() to a server 
job (the server job calls MPI_Open_port() before calling by MPI_Comm_accept()) 
when the server job uses MX MTL (although it works without problems when the 
server uses MX BTL). The server job runs on a cluster connected to a Myrinet 
10G network (MX 1.2.11) in addition to an ordinary Ethernet network. The client 
runs on a different machine, not connected to the Myrinet network but 
accessible via the Ethernet network.

Joined to this message are the simple server and client programs (87 lines 
total) called simpleserver.c and simpleclient.c .

Note we are using OpenMPI 1.4.2 on x86_64 Linux  (server: Fedora 7 client: 
Fedora 12).

Compiling these programs with mpicc on the server front node (fn1) and client 
workstation (linux15) works well:

   [audet@fn1 bench]$ mpicc simpleserver.c -o simpleserver

   [audet@linux15 mpi]$ mpicc simpleclient.c -o simpleclient

Then if we start the server on the cluster (job is started on cluster node 
cn18) and asking to use MTL :

   [audet@fn1 bench]$ mpiexec -x MX_RCACHE=2 -machinefile machinefile_cn18 
--mca mtl mx --mca pml cm -n 1 ./simpleserver

It prints the server port (Note we uses MX_RCACHE=2 to avoid a warning but it 
doesn't affect the current issue) :

   Server port = 
'3548905472.0;tcp://172.17.15.20:39517+3548905473.0;tcp://172.17.10.18:47427:300'

Then starting the client on the workstation with this port number:

   [audet@linux15 mpi]$ mpiexec -n 1 ./simpleclient 
'3548905472.0;tcp://172.17.15.20:39517+3548905473.0;tcp://172.17.10.18:47427:300'

The server process core dump as follow:

   MPI_Comm_accept() sucessful...
   [cn18:24582] *** Process received signal ***
   [cn18:24582] Signal: Segmentation fault (11)
   [cn18:24582] Signal code: Address not mapped (1)
   [cn18:24582] Failing at address: 0x38
   [cn18:24582] [ 0] /lib64/libpthread.so.0 [0x305de0dd20]
   [cn18:24582] [ 1] /usr/local/openmpi-1.4.2/lib/openmpi/mca_mtl_mx.so 
[0x2aaaad6a7e6d]
   [cn18:24582] [ 2] /usr/local/openmpi-1.4.2/lib/openmpi/mca_pml_cm.so 
[0x2aaaad4a319d]
   [cn18:24582] [ 3] 
/usr/local/openmpi/lib/libmpi.so.0(ompi_dpm_base_disconnect_init+0xbf) 
[0x2aaaaab1403f]
   [cn18:24582] [ 4] /usr/local/openmpi-1.4.2/lib/openmpi/mca_dpm_orte.so 
[0x2aaaaed0eb19]
   [cn18:24582] [ 5] 
/usr/local/openmpi/lib/libmpi.so.0(PMPI_Comm_disconnect+0xa0) [0x2aaaaaaf4f20]
   [cn18:24582] [ 6] ./simpleserver(main+0x14c) [0x400d04]
   [cn18:24582] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x305ce1daa4]
   [cn18:24582] [ 8] ./simpleserver [0x400b09]
   [cn18:24582] *** End of error message ***
   --------------------------------------------------------------------------
   mpiexec noticed that process rank 0 with PID 24582 on node cn18 exited on 
signal 11 (Segmentation fault).
   --------------------------------------------------------------------------
   [audet@fn1 bench]$

And the client stops with the following error message:

   --------------------------------------------------------------------------
   At least one pair of MPI processes are unable to reach each other for
   MPI communications.  This means that no Open MPI device has indicated
   that it can be used to communicate between these processes.  This is
   an error; Open MPI requires that all MPI processes be able to reach
   each other.  This error can sometimes be the result of forgetting to
   specify the "self" BTL.

     Process 1 ([[31386,1],0]) is on host: linux15
     Process 2 ([[54152,1],0]) is on host: cn18
     BTLs attempted: self sm tcp

   Your MPI job is now going to abort; sorry.
   --------------------------------------------------------------------------
   MPI_Comm_connect() sucessful...
   Error in comm_disconnect_waitall
   [audet@linux15 mpi]$

I really don't understand this message because the client can connect with the 
server using tcp on Ethernet.

Moreover if I add MCA options when I start the server to include TCP BTL, the 
same problems happens (the argument list then becomes: '--mca mtl mx --mca pml 
cm --mca btl tcp,shared,self' ).

However if I remove all MCA options when I start the server (e.g. when BTL MX 
is used), no such problems appears. Everything goes fine also if I start the 
server with an explicit request to use BTL MX and TCP (e.g. with options '--mca 
btl mx,tcp,sm,self').

Four running our server application we really prefer to use MX MTL over MX BTL 
since it is much faster with MTL (although the usual ping pong test is only 
slightly faster with MTL).

Enclosed also the output of ompi_info --all runned on the cluster node (cn18) 
and the workstation (linux15).

Please help me. I think my problem is only a question of wrong MCA parameters 
(which is obscure for me).

Thanks,

Martin Audet, Research Officer
Industrial Material Institute
National Research Council of Canada
75 de Mortagne, Boucherville, QC, J4B 6Y4, Canada

Reply via email to