On Feb 13, 2012, at 1:28 PM, Jeff Squyres wrote: > You might want to fully uninstall the disto-installed version of Open MPI on > all the nodes (e.g., Red Hat may have installed a different version of Open > MPI, and that version is being found in your $PATH before your > custom-installedversion).
Besides Jeff's suggestion, also prepend [rather than append], the new OpenMPI to your PATH and LD_LIBRARY_PATH [bash style here]: export PATH=/my/openmpi/1.4.4/bin:$PATH export LD_LIBRARY_PATH=/my/openmpi/1.4.4/lib:$LD_LIBRARY_PATH Multiple installed flavors, releases, and versions of MPI are often a source of confusion. I hope this helps, Gus Correa > > > On Feb 13, 2012, at 12:12 PM, Richard Bardwell wrote: > >> OK, 1.4.4 is happily installed on both machines. But, I now get a really >> weird error when running on the 2 nodes. I get >> Error: unknown option "--daemonize" >> even though I am just running with -np 2 -hostfile test.hst >> >> The program runs fine on 2 cores if running locally on each node. >> >> Any ideas ?? >> >> Thanks >> >> Richard >> ----- Original Message ----- From: "Gustavo Correa" <g...@ldeo.columbia.edu> >> To: "Open MPI Users" <us...@open-mpi.org> >> Sent: Monday, February 13, 2012 4:22 PM >> Subject: Re: [OMPI users] MPI orte_init fails on remote nodes >> >> >>> On Feb 13, 2012, at 11:02 AM, Richard Bardwell wrote: >>>> Ralph >>>> I had done a make clean in the 1.2.8 directory if that is what you meant ? >>>> Or do I need to do something else ? >>>> I appreciate your help on this by the way ;-) >>> Hi Richard >>> You can install in a different directory, totally separate from 1.2.8. >>> Create a new work directory [which is not the final installation directory, >>> just work, say /tmp/openmpi/1.4.4/work]. >>> Launch the OpenMPI 1.4.4 configure script from this new work directory with >>> the --prefix pointing to your desired installation directory [e.g. >>> /home/richard/openmpi/1.4.4/]. >>> I am assuming this is NFS mounted on the nodes [if you have a cluster]. >>> [Check all options with 'configure --help'.] >>> Then do make, make install. >>> Finally set your PATH and LD_LIBRARY_PATH to point to the new installation >>> directory, >>> to prevent mixing with the old 1.2.8. >>> I have a number of OpenMPI versions here, compiled with various compilers, >>> and they coexist well this way. >>> I hope this helps, >>> Gus Correa >>>> ----- Original Message ----- >>>> From: Ralph Castain >>>> To: Open MPI Users >>>> Sent: Monday, February 13, 2012 3:41 PM >>>> Subject: Re: [OMPI users] MPI orte_init fails on remote nodes >>>> You need to clean out the old attempt - that is a stale file >>>> Sent from my iPad >>>> On Feb 13, 2012, at 7:36 AM, "Richard Bardwell" <rich...@sharc.co.uk> >>>> wrote: >>>>> OK, I installed 1.4.4, rebuilt the exec and guess what ...... I now get >>>>> some weird errors as below: >>>>> mca: base: component_find: unable to open >>>>> /usr/local/lib/openmpi/mca_ras_dash_host >>>>> along with a few other files >>>>> even though the .so / .la files are all there ! >>>>> ----- Original Message ----- >>>>> From: Ralph Castain >>>>> To: Open MPI Users >>>>> Sent: Monday, February 13, 2012 2:59 PM >>>>> Subject: Re: [OMPI users] MPI orte_init fails on remote nodes >>>>> Good heavens - where did you find something that old? Can you use a more >>>>> recent version? >>>>> Sent from my iPad >>>>> >>>>>> Gentlemen >>>>>> I am struggling to get MPI working when the hostfile contains different >>>>>> nodes. >>>>>> I get the error below. Any ideas ?? I can ssh without password between >>>>>> the two >>>>>> nodes. I am running 1.2.8 MPI on both machines. >>>>>> Any help most appreciated !!!!! >>>>>> MPITEST/v8_mpi_test> mpiexec -n 2 --debug-daemons -hostfile test.hst >>>>>> /home/sharc/MPITEST/v8_mpi_test/mpitest >>>>>> Daemon [0,0,1] checking in as pid 10490 on host 192.0.2.67 >>>>>> [linux-z0je:08804] [NO-NAME] ORTE_ERROR_LOG: Not found in file >>>>>> runtime/orte_init_stage1.c at line 182 >>>>>> -------------------------------------------------------------------------- >>>>>> It looks like orte_init failed for some reason; your parallel process is >>>>>> likely to abort. There are many reasons that a parallel process can >>>>>> fail during orte_init; some of which are due to configuration or >>>>>> environment problems. This failure appears to be an internal failure; >>>>>> here's some additional information (which may only be relevant to an >>>>>> Open MPI developer): >>>>>> orte_rml_base_select failed >>>>>> --> Returned value -13 instead of ORTE_SUCCESS >>>>>> -------------------------------------------------------------------------- >>>>>> [linux-z0je:08804] [NO-NAME] ORTE_ERROR_LOG: Not found in file >>>>>> runtime/orte_system_init.c at line 42 >>>>>> [linux-z0je:08804] [NO-NAME] ORTE_ERROR_LOG: Not found in file >>>>>> runtime/orte_init.c at line 52 >>>>>> Open RTE was unable to initialize properly. The error occured while >>>>>> attempting to orte_init(). Returned value -13 instead of ORTE_SUCCESS. >>>>>> [linux-tmpw:10490] [0,0,1] orted_recv_pls: received message from [0,0,0] >>>>>> [linux-tmpw:10490] [0,0,1] orted_recv_pls: received kill_local_procs >>>>>> [linux-tmpw:10489] [0,0,0] ORTE_ERROR_LOG: Timeout in file >>>>>> base/pls_base_orted_cmds.c at line 275 >>>>>> [linux-tmpw:10489] [0,0,0] ORTE_ERROR_LOG: Timeout in file >>>>>> pls_rsh_module.c at line 1158 >>>>>> [linux-tmpw:10489] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c >>>>>> at line 90 >>>>>> [linux-tmpw:10489] ERROR: A daemon on node 192.0.2.68 failed to start as >>>>>> expected. >>>>>> [linux-tmpw:10489] ERROR: There may be more information available from >>>>>> [linux-tmpw:10489] ERROR: the remote shell (see above). >>>>>> [linux-tmpw:10489] ERROR: The daemon exited unexpectedly with status 243. >>>>>> [linux-tmpw:10490] [0,0,1] orted_recv_pls: received message from [0,0,0] >>>>>> [linux-tmpw:10490] [0,0,1] orted_recv_pls: received exit >>>>>> [linux-tmpw:10489] [0,0,0] ORTE_ERROR_LOG: Timeout in file >>>>>> base/pls_base_orted_cmds.c at line 188 >>>>>> [linux-tmpw:10489] [0,0,0] ORTE_ERROR_LOG: Timeout in file >>>>>> pls_rsh_module.c at line 1190 >>>>>> -------------------------------------------------------------------------- >>>>>> mpiexec was unable to cleanly terminate the daemons for this job. >>>>>> Returned value Timeout instead of ORTE_SUCCESS. >>>>>> -------------------------------------------------------------------------- >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users