Hi,

> Am 13.04.2017 um 11:00 schrieb Vincent Drach <vincent.dr...@plymouth.ac.uk>:
> 
> 
> Dear mailing list,
> 
> We are experimenting run time failure  on a small cluster with openmpi-2.0.2 
> and gcc 6.3 and gcc 5.4.
> The job start normally and lots of communications are performed. After 5-10 
> minutes the connection to the hosts is closed and
> the following error message is reported:
> --------------------------------------------------------------------------
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> 
> * not finding the required libraries and/or binaries on
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>   settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>   Please verify your allocation and authorities.
> 
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to use.
> 
> *  compilation of the orted with dynamic libraries when static are required
>   (e.g., on Cray). Please check your configure cmd line and consider using
>   one of the contrib/platform definitions for your system type.
> 
> * an inability to create a connection back to mpirun due to a
>   lack of common network interfaces and/or no route found between
>   them. Please check network connectivity (including firewalls
>   and network routing requirements).
> 
> 
> 
> The issue does not seem to be due to the infiniband configuration, because 
> the job also crash when using tcp protocol.
> 
> Do you have any clue of what could be the issue ?

Is it a single MPI process or is the application issuing many `mpiexec` during 
its runtime?

Is there any limit how often `ssh` may access a node in a timeframe? Do you use 
any queuing system?

-- Reuti

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to