Re: [OMPI users] TCP usage in MPI singletons
Hi Gilles, all, Using `OMPI_MCA_ess_singleton_isolated=true ./program` achieves the desired result of establishing no TCP connections for a singleton execution. Thank you for the suggestion! Best regards, -Dan On Wed, Apr 17, 2019 at 5:35 PM Gilles Gouaillardet wrote: > Daniel, > > > If your MPI singleton will never MPI_Comm_spawn(), then you can use the > isolated mode like this > > OMPI_MCA_ess_singleton_isolated=true ./program > > > You can also save some ports by blacklisting the btl/tcp component > > > OMPI_MCA_ess_singleton_isolated=true OMPI_MCA_pml=ob1 > OMPI_MCA_btl=vader,self ./program > > > Cheers, > > > Gilles > > On 4/18/2019 3:51 AM, Daniel Hemberger wrote: > > Hi everyone, > > > > I've been trying to track down the source of TCP connections when > > running MPI singletons, with the goal of avoiding all TCP > > communication to free up ports for other processes. I have a local apt > > install of OpenMPI 2.1.1 on Ubuntu 18.04 which does not establish any > > TCP connections by default, either when run as "mpirun -np 1 > > ./program" or "./program". But it has non-TCP alternatives for both > > the BTL (vader, self, etc.) and OOB (ud and usock) frameworks, so I > > was not surprised by this result. > > > > On a remote machine, I'm running the same test with an assortment of > > OpenMPI versions (1.6.4, 1.8.6, 4.0.0, 4.0.1 on RHEL6 and 1.10.7 on > > RHEL7). In all but 1.8.6 and 1.10.7, there is always a TCP connection > > established, even if I disable the TCP BTL on the command line (e.g. > > "mpirun --mca btl ^tcp"). Therefore, I assumed this was because `tcp` > > was the only OOB interface available in these installations. This TCP > > connection is established both for "mpirun -np 1 ./program" and > > "./program". > > > > The confusing part is that the 1.8.6 and 1.10.7 installations only > > appear to establish a TCP connection when invoked with "mpirun -np 1 > > ./program", but _not_ with "./program", even though its only OOB > > interface was also `tcp`. This result was not consistent with my > > understanding, so now I am confused about when I should expect TCP > > communication to occur. > > > > Is there a known explanation for what I am seeing? Is there actually a > > way to get singletons to forego all TCP communication, even if TCP is > > the only OOB available, or is there something else at play here? I'd > > be happy to provide any config.log files or ompi_info output if it > > would help. > > > > For more context, the underlying issue I'm trying to resolve is that > > we are (unfortunately) running many short instances of mpirun, and the > > TCP connections are piling up in the TIME_WAIT state because they > > aren't cleaned up faster than we create them. > > > > Any advice or pointers would be greatly appreciated! > > > > Thanks, > > -Dan > > > > ___ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] MPI_INIT failed 4.0.1
Thanks for the hint. Regards, Mahmood On Thu, Apr 18, 2019 at 2:47 AM Reuti wrote: > Hi, > > Am 17.04.2019 um 11:07 schrieb Mahmood Naderan: > > > Hi, > > After successful installation of v4 on a custom location, I see some > errors while the default installation (v2) hasn't. > > Did you also recompile your application with this version of Open MPI? > > -- Reuti > > > > $ /share/apps/softwares/openmpi-4.0.1/bin/mpirun --version > > mpirun (Open MPI) 4.0.1 > > > > Report bugs to http://www.open-mpi.org/community/help/ > > $ /share/apps/softwares/openmpi-4.0.1/bin/mpirun -np 4 pw.x -i > mos2.rlx.in > > > -- > > It looks like MPI_INIT failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during MPI_INIT; some of which are due to configuration or > environment > > problems. This failure appears to be an internal failure; here's some > > additional information (which may only be relevant to an Open MPI > > developer): > > > > ompi_mpi_init: ompi_rte_init failed > > --> Returned "(null)" (-43) instead of "Success" (0) > > > -- > > > -- > > It looks like MPI_INIT failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during MPI_INIT; some of which are due to configuration or > environment > > problems. This failure appears to be an internal failure; here's some > > additional information (which may only be relevant to an Open MPI > > developer): > > > > ompi_mpi_init: ompi_rte_init failed > > --> Returned "(null)" (-43) instead of "Success" (0) > > > -- > > *** An error occurred in MPI_Init > > *** on a NULL communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > > ***and potentially your MPI job) > > [rocks7.jupiterclusterscu.com:18531] Local abort before MPI_INIT > completed completed successfully, but am not able to aggregate error > messages, and not able to guarantee that all other processes were killed! > > *** An error occurred in MPI_Init > > *** on a NULL communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > > ***and potentially your MPI job) > > [rocks7.jupiterclusterscu.com:18532] Local abort before MPI_INIT > completed completed successfully, but am not able to aggregate error > messages, and not able to guarantee that all other processes were killed! > > > -- > > Primary job terminated normally, but 1 process returned > > a non-zero exit code. Per user-direction, the job has been aborted. > > > -- > > > -- > > It looks like MPI_INIT failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during MPI_INIT; some of which are due to configuration or > environment > > problems. This failure appears to be an internal failure; here's some > > additional information (which may only be relevant to an Open MPI > > developer): > > > > ompi_mpi_init: ompi_rte_init failed > > --> Returned "(null)" (-43) instead of "Success" (0) > > > -- > > > -- > > It looks like MPI_INIT failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during MPI_INIT; some of which are due to configuration or > environment > > problems. This failure appears to be an internal failure; here's some > > additional information (which may only be relevant to an Open MPI > > developer): > > > > ompi_mpi_init: ompi_rte_init failed > > --> Returned "(null)" (-43) instead of "Success" (0) > > > -- > > *** An error occurred in MPI_Init > > *** on a NULL communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > > ***and potentially your MPI job) > > [rocks7.jupiterclusterscu.com:18530] Local abort before MPI_INIT > completed completed successfully, but am not able to aggregate error > messages, and not able to guarantee that all other processes were killed! > > *** An error occurred in MPI_Init > > *** on a NULL communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > > ***and potentially your MPI job) > > [rocks7.jupiterclusterscu.com:18533] Local abort before