Re: [OMPI users] TCP usage in MPI singletons

2019-04-19 Thread Daniel Hemberger
Hi Gilles, all,

Using `OMPI_MCA_ess_singleton_isolated=true ./program` achieves the desired
result of establishing no TCP connections for a singleton execution.

Thank you for the suggestion!

Best regards,
-Dan

On Wed, Apr 17, 2019 at 5:35 PM Gilles Gouaillardet 
wrote:

> Daniel,
>
>
> If your MPI singleton will never MPI_Comm_spawn(), then you can use the
> isolated mode like this
>
> OMPI_MCA_ess_singleton_isolated=true ./program
>
>
> You can also save some ports by blacklisting the btl/tcp component
>
>
> OMPI_MCA_ess_singleton_isolated=true OMPI_MCA_pml=ob1
> OMPI_MCA_btl=vader,self ./program
>
>
> Cheers,
>
>
> Gilles
>
> On 4/18/2019 3:51 AM, Daniel Hemberger wrote:
> > Hi everyone,
> >
> > I've been trying to track down the source of TCP connections when
> > running MPI singletons, with the goal of avoiding all TCP
> > communication to free up ports for other processes. I have a local apt
> > install of OpenMPI 2.1.1 on Ubuntu 18.04 which does not establish any
> > TCP connections by default, either when run as "mpirun -np 1
> > ./program" or "./program". But it has non-TCP alternatives for both
> > the BTL (vader, self, etc.) and OOB (ud and usock) frameworks, so I
> > was not surprised by this result.
> >
> > On a remote machine, I'm running the same test with an assortment of
> > OpenMPI versions (1.6.4, 1.8.6, 4.0.0, 4.0.1 on RHEL6 and 1.10.7 on
> > RHEL7). In all but 1.8.6 and 1.10.7, there is always a TCP connection
> > established, even if I disable the TCP BTL on the command line (e.g.
> > "mpirun --mca btl ^tcp"). Therefore, I assumed this was because `tcp`
> > was the only OOB interface available in these installations. This TCP
> > connection is established both for "mpirun -np 1 ./program" and
> > "./program".
> >
> > The confusing part is that the 1.8.6 and 1.10.7 installations only
> > appear to establish a TCP connection when invoked with "mpirun -np 1
> > ./program", but _not_ with "./program", even though its only OOB
> > interface was also `tcp`. This result was not consistent with my
> > understanding, so now I am confused about when I should expect TCP
> > communication to occur.
> >
> > Is there a known explanation for what I am seeing? Is there actually a
> > way to get singletons to forego all TCP communication, even if TCP is
> > the only OOB available, or is there something else at play here? I'd
> > be happy to provide any config.log files or ompi_info output if it
> > would help.
> >
> > For more context, the underlying issue I'm trying to resolve is that
> > we are (unfortunately) running many short instances of mpirun, and the
> > TCP connections are piling up in the TIME_WAIT state because they
> > aren't cleaned up faster than we create them.
> >
> > Any advice or pointers would be greatly appreciated!
> >
> > Thanks,
> > -Dan
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI_INIT failed 4.0.1

2019-04-19 Thread Mahmood Naderan
Thanks for the hint.

Regards,
Mahmood




On Thu, Apr 18, 2019 at 2:47 AM Reuti  wrote:

> Hi,
>
> Am 17.04.2019 um 11:07 schrieb Mahmood Naderan:
>
> > Hi,
> > After successful installation of v4 on a custom location, I see some
> errors while the default installation (v2) hasn't.
>
> Did you also recompile your application with this version of Open MPI?
>
> -- Reuti
>
>
> > $ /share/apps/softwares/openmpi-4.0.1/bin/mpirun --version
> > mpirun (Open MPI) 4.0.1
> >
> > Report bugs to http://www.open-mpi.org/community/help/
> > $ /share/apps/softwares/openmpi-4.0.1/bin/mpirun -np 4 pw.x -i
> mos2.rlx.in
> >
> --
> > It looks like MPI_INIT failed for some reason; your parallel process is
> > likely to abort.  There are many reasons that a parallel process can
> > fail during MPI_INIT; some of which are due to configuration or
> environment
> > problems.  This failure appears to be an internal failure; here's some
> > additional information (which may only be relevant to an Open MPI
> > developer):
> >
> >   ompi_mpi_init: ompi_rte_init failed
> >   --> Returned "(null)" (-43) instead of "Success" (0)
> >
> --
> >
> --
> > It looks like MPI_INIT failed for some reason; your parallel process is
> > likely to abort.  There are many reasons that a parallel process can
> > fail during MPI_INIT; some of which are due to configuration or
> environment
> > problems.  This failure appears to be an internal failure; here's some
> > additional information (which may only be relevant to an Open MPI
> > developer):
> >
> >   ompi_mpi_init: ompi_rte_init failed
> >   --> Returned "(null)" (-43) instead of "Success" (0)
> >
> --
> > *** An error occurred in MPI_Init
> > *** on a NULL communicator
> > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> > ***and potentially your MPI job)
> > [rocks7.jupiterclusterscu.com:18531] Local abort before MPI_INIT
> completed completed successfully, but am not able to aggregate error
> messages, and not able to guarantee that all other processes were killed!
> > *** An error occurred in MPI_Init
> > *** on a NULL communicator
> > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> > ***and potentially your MPI job)
> > [rocks7.jupiterclusterscu.com:18532] Local abort before MPI_INIT
> completed completed successfully, but am not able to aggregate error
> messages, and not able to guarantee that all other processes were killed!
> >
> --
> > Primary job  terminated normally, but 1 process returned
> > a non-zero exit code. Per user-direction, the job has been aborted.
> >
> --
> >
> --
> > It looks like MPI_INIT failed for some reason; your parallel process is
> > likely to abort.  There are many reasons that a parallel process can
> > fail during MPI_INIT; some of which are due to configuration or
> environment
> > problems.  This failure appears to be an internal failure; here's some
> > additional information (which may only be relevant to an Open MPI
> > developer):
> >
> >   ompi_mpi_init: ompi_rte_init failed
> >   --> Returned "(null)" (-43) instead of "Success" (0)
> >
> --
> >
> --
> > It looks like MPI_INIT failed for some reason; your parallel process is
> > likely to abort.  There are many reasons that a parallel process can
> > fail during MPI_INIT; some of which are due to configuration or
> environment
> > problems.  This failure appears to be an internal failure; here's some
> > additional information (which may only be relevant to an Open MPI
> > developer):
> >
> >   ompi_mpi_init: ompi_rte_init failed
> >   --> Returned "(null)" (-43) instead of "Success" (0)
> >
> --
> > *** An error occurred in MPI_Init
> > *** on a NULL communicator
> > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> > ***and potentially your MPI job)
> > [rocks7.jupiterclusterscu.com:18530] Local abort before MPI_INIT
> completed completed successfully, but am not able to aggregate error
> messages, and not able to guarantee that all other processes were killed!
> > *** An error occurred in MPI_Init
> > *** on a NULL communicator
> > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> > ***and potentially your MPI job)
> > [rocks7.jupiterclusterscu.com:18533] Local abort before