Re: [OMPI users] Unable to connect to a server using MX MTL with TCP
Thanks to both Scott and Jeff ! Next time I have a problem, I will check the README file first (Doh !). Also we might mitigate the problem by connecting the workstation to the Myrinet switch. Martin -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres Sent: June 9, 2010 15:34 To: Open MPI Users Subject: Re: [OMPI users] Unable to connect to a server using MX MTL with TCP On Jun 5, 2010, at 7:52 AM, Scott Atchley wrote: > I do not think this is a supported scenario. George or Jeff can correct me, > but when you use the MX MTL you are using the pml cm and not the pml ob1. The > BTLs are part of ob1. When using the MX MTL, it cannot use the TCP BTL. > > You only solution would be to use the MX BTL. Sorry for the delayed reply. Scott is correct; the MX MTL uses the "cm" PML. The "cm" PML can only use *one* MTL at a time (little known fact of Open MPI lore: "cm" stands for several things, one of which is "Connor MacLeod" -- there can only be one). Here's a chunk of text from the README: - There are three MPI network models available: "ob1", "csum", and "cm". "ob1" and "csum" use BTL ("Byte Transfer Layer") components for each supported network. "cm" uses MTL ("Matching Tranport Layer") components for each supported network. - "ob1" supports a variety of networks that can be used in combination with each other (per OS constraints; e.g., there are reports that the GM and OpenFabrics kernel drivers do not operate well together): - OpenFabrics: InfiniBand and iWARP - Loopback (send-to-self) - Myrinet: GM and MX (including Open-MX) - Portals - Quadrics Elan - Shared memory - TCP - SCTP - uDAPL - "csum" is exactly the same as "ob1", except that it performs additional data integrity checks to ensure that the received data is intact (vs. trusting the underlying network to deliver the data correctly). csum supports all the same networks as ob1, but there is a performance penalty for the additional integrity checks. - "cm" supports a smaller number of networks (and they cannot be used together), but may provide better better overall MPI performance: - Myrinet MX (including Open-MX, but not GM) - InfiniPath PSM - Portals Open MPI will, by default, choose to use "cm" when the InfiniPath PSM MTL can be used. Otherwise, "ob1" will be used and the corresponding BTLs will be selected. "csum" will never be selected by default. Users can force the use of ob1 or cm if desired by setting the "pml" MCA parameter at run-time: shell$ mpirun --mca pml ob1 ... or shell$ mpirun --mca pml csum ... or shell$ mpirun --mca pml cm ... -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Unable to connect to a server using MX MTL with TCP
On Jun 5, 2010, at 7:52 AM, Scott Atchley wrote: > I do not think this is a supported scenario. George or Jeff can correct me, > but when you use the MX MTL you are using the pml cm and not the pml ob1. The > BTLs are part of ob1. When using the MX MTL, it cannot use the TCP BTL. > > You only solution would be to use the MX BTL. Sorry for the delayed reply. Scott is correct; the MX MTL uses the "cm" PML. The "cm" PML can only use *one* MTL at a time (little known fact of Open MPI lore: "cm" stands for several things, one of which is "Connor MacLeod" -- there can only be one). Here's a chunk of text from the README: - There are three MPI network models available: "ob1", "csum", and "cm". "ob1" and "csum" use BTL ("Byte Transfer Layer") components for each supported network. "cm" uses MTL ("Matching Tranport Layer") components for each supported network. - "ob1" supports a variety of networks that can be used in combination with each other (per OS constraints; e.g., there are reports that the GM and OpenFabrics kernel drivers do not operate well together): - OpenFabrics: InfiniBand and iWARP - Loopback (send-to-self) - Myrinet: GM and MX (including Open-MX) - Portals - Quadrics Elan - Shared memory - TCP - SCTP - uDAPL - "csum" is exactly the same as "ob1", except that it performs additional data integrity checks to ensure that the received data is intact (vs. trusting the underlying network to deliver the data correctly). csum supports all the same networks as ob1, but there is a performance penalty for the additional integrity checks. - "cm" supports a smaller number of networks (and they cannot be used together), but may provide better better overall MPI performance: - Myrinet MX (including Open-MX, but not GM) - InfiniPath PSM - Portals Open MPI will, by default, choose to use "cm" when the InfiniPath PSM MTL can be used. Otherwise, "ob1" will be used and the corresponding BTLs will be selected. "csum" will never be selected by default. Users can force the use of ob1 or cm if desired by setting the "pml" MCA parameter at run-time: shell$ mpirun --mca pml ob1 ... or shell$ mpirun --mca pml csum ... or shell$ mpirun --mca pml cm ... -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Unable to connect to a server using MX MTL with TCP
On Jun 4, 2010, at 7:18 PM, Audet, Martin wrote: > Hi OpenMPI_Users and OpenMPI_Developers, > > I'm unable to connect a client application using MPI_Comm_connect() to a > server job (the server job calls MPI_Open_port() before calling by > MPI_Comm_accept()) when the server job uses MX MTL (although it works without > problems when the server uses MX BTL). The server job runs on a cluster > connected to a Myrinet 10G network (MX 1.2.11) in addition to an ordinary > Ethernet network. The client runs on a different machine, not connected to > the Myrinet network but accessible via the Ethernet network. Hi Martin, I do not think this is a supported scenario. George or Jeff can correct me, but when you use the MX MTL you are using the pml cm and not the pml ob1. The BTLs are part of ob1. When using the MX MTL, it cannot use the TCP BTL. You only solution would be to use the MX BTL. Scott
[OMPI users] Unable to connect to a server using MX MTL with TCP
Hi OpenMPI_Users and OpenMPI_Developers, I'm unable to connect a client application using MPI_Comm_connect() to a server job (the server job calls MPI_Open_port() before calling by MPI_Comm_accept()) when the server job uses MX MTL (although it works without problems when the server uses MX BTL). The server job runs on a cluster connected to a Myrinet 10G network (MX 1.2.11) in addition to an ordinary Ethernet network. The client runs on a different machine, not connected to the Myrinet network but accessible via the Ethernet network. Joined to this message are the simple server and client programs (87 lines total) called simpleserver.c and simpleclient.c . Note we are using OpenMPI 1.4.2 on x86_64 Linux (server: Fedora 7 client: Fedora 12). Compiling these programs with mpicc on the server front node (fn1) and client workstation (linux15) works well: [audet@fn1 bench]$ mpicc simpleserver.c -o simpleserver [audet@linux15 mpi]$ mpicc simpleclient.c -o simpleclient Then if we start the server on the cluster (job is started on cluster node cn18) and asking to use MTL : [audet@fn1 bench]$ mpiexec -x MX_RCACHE=2 -machinefile machinefile_cn18 --mca mtl mx --mca pml cm -n 1 ./simpleserver It prints the server port (Note we uses MX_RCACHE=2 to avoid a warning but it doesn't affect the current issue) : Server port = '3548905472.0;tcp://172.17.15.20:39517+3548905473.0;tcp://172.17.10.18:47427:300' Then starting the client on the workstation with this port number: [audet@linux15 mpi]$ mpiexec -n 1 ./simpleclient '3548905472.0;tcp://172.17.15.20:39517+3548905473.0;tcp://172.17.10.18:47427:300' The server process core dump as follow: MPI_Comm_accept() sucessful... [cn18:24582] *** Process received signal *** [cn18:24582] Signal: Segmentation fault (11) [cn18:24582] Signal code: Address not mapped (1) [cn18:24582] Failing at address: 0x38 [cn18:24582] [ 0] /lib64/libpthread.so.0 [0x305de0dd20] [cn18:24582] [ 1] /usr/local/openmpi-1.4.2/lib/openmpi/mca_mtl_mx.so [0x2d6a7e6d] [cn18:24582] [ 2] /usr/local/openmpi-1.4.2/lib/openmpi/mca_pml_cm.so [0x2d4a319d] [cn18:24582] [ 3] /usr/local/openmpi/lib/libmpi.so.0(ompi_dpm_base_disconnect_init+0xbf) [0x2ab1403f] [cn18:24582] [ 4] /usr/local/openmpi-1.4.2/lib/openmpi/mca_dpm_orte.so [0x2ed0eb19] [cn18:24582] [ 5] /usr/local/openmpi/lib/libmpi.so.0(PMPI_Comm_disconnect+0xa0) [0x2aaf4f20] [cn18:24582] [ 6] ./simpleserver(main+0x14c) [0x400d04] [cn18:24582] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x305ce1daa4] [cn18:24582] [ 8] ./simpleserver [0x400b09] [cn18:24582] *** End of error message *** -- mpiexec noticed that process rank 0 with PID 24582 on node cn18 exited on signal 11 (Segmentation fault). -- [audet@fn1 bench]$ And the client stops with the following error message: -- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[31386,1],0]) is on host: linux15 Process 2 ([[54152,1],0]) is on host: cn18 BTLs attempted: self sm tcp Your MPI job is now going to abort; sorry. -- MPI_Comm_connect() sucessful... Error in comm_disconnect_waitall [audet@linux15 mpi]$ I really don't understand this message because the client can connect with the server using tcp on Ethernet. Moreover if I add MCA options when I start the server to include TCP BTL, the same problems happens (the argument list then becomes: '--mca mtl mx --mca pml cm --mca btl tcp,shared,self' ). However if I remove all MCA options when I start the server (e.g. when BTL MX is used), no such problems appears. Everything goes fine also if I start the server with an explicit request to use BTL MX and TCP (e.g. with options '--mca btl mx,tcp,sm,self'). Four running our server application we really prefer to use MX MTL over MX BTL since it is much faster with MTL (although the usual ping pong test is only slightly faster with MTL). Enclosed also the output of ompi_info --all runned on the cluster node (cn18) and the workstation (linux15). Please help me. I think my problem is only a question of wrong MCA parameters (which is obscure for me). Thanks, Martin Audet, Research Officer Industrial Material Institute National Research Council of Canada 75 de Mortagne, Boucherville, QC, J4B 6Y4, Canada