Re: [OMPI users] Accessing OpenMPI processes over Internet using ssh
On Nov 24, 2011, at 2:00 AM, Reuti wrote: > Hi, > > Am 24.11.2011 um 05:26 schrieb Jaison Paul: > >> I am trying to access OpenMPI processes over Internet using ssh and not >> quite successful, yet. I believe that I should be able to do it. >> >> I have to run one process on my PC and the rest on a remote cluster over >> internet. I have set the public keys (at .ssh/authorized_keys) to access >> remote nodes without a password. >> >> I use hostfile to run mpi. It will read something like: >> - >> localhost >> u...@remotehost.com > > this is not a valid syntax for Open MPI. This isn't correct - we have long supported that syntax in a hostfile, and there is no issue with having a different user name at each node. Jaison: are you sure your nodes are setup for password-less ssh? In other words, have you setup your .ssh files on the remote nodes so they will allow us to ssh a process on them without providing a password? This is the typical problem we see. > > >> - >> But it fails. >> >> The issue seems to be the user! That is, the user on my PC is different to >> that of user at remotehosts. That's my assumption. >> >> Is this the problem? Is there any work-around to solve this issue? Do I need >> to have same username at all nodes to solve this issue? > > You can define nicknames for an ssh connection in a file ~/.ssh/config like: > > Host foobar >User baz >Hostname the.remote.server.demo >Port 1234 > > While this will work with any nickname for an ssh connection, in your case > the nickname must match the one specified in the hostfile, as Open MPI won't > use this lookup file: > > Host remotehost.com >User user > > ssh should then use the entries therein to initiate the connection. For > details you can have a look at `man ssh_config`. > > -- Reuti > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] open-mpi error
Hi Markus You have some major problems with confused installations of MPIs. First, you cannot compile an application against MPICH and expect to run it with OMPI - the two are not binary compatible. You need to compile against the MPI installation you intend to run against. Second, your errors appear to be because you are not pointing your library path at the OMPI installation, and so the libraries are not being found. You need to set LD_LIBRARY_PATH to include the path to where you installed OMPI. Based on the configure line you give, that would mean ensuring that /opt/mpirun/lib was in that envar. Likewise, /opt/mpirun/bin needs to be in your PATH. Once you have those correctly set, and build your app against the appropriate mpicc, you should be able to run. BTW: your last message indicates that you built against an old LAM MPI, so you appear to have some pretty old software laying around. Perhaps cleaning out some of the old MPI installations would help. On Nov 24, 2011, at 4:32 PM, Markus Stiller wrote: > On 11/24/2011 10:08 PM, MM wrote: >> Hi >> >> I get the same error while linking against home built 1.5.4 openmpi libs on >> win32. >> I didn't get this error against the prebuilt libs. >> >> I see you use Suse. There probably is a openmpi.rpm or openmpi.dpkg already >> available for Suse which contains the libraries and you could link against >> those and that may work >> >> MM >> >> -Original Message- >> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On >> Behalf Of Markus Stiller >> Sent: 24 November 2011 20:41 >> To: us...@open-mpi.org >> Subject: [OMPI users] open-mpi error >> >> Hello, >> >> i have some problem with mpi, i looked in the FAQ and google already but i >> couldnt find a solution. >> >> To build mpi i used this: >> shell$ ./configure --prefix=/opt/mpirun >> <...lots of output...> >> shell$ make all install >> >> Worked fine so far. I am using dlpoly, and this makefile: >> $(MAKE) LD="mpif90 -o" LDFLAGS="-O3" \ >> FC="mpif90 -c" FCFLAGS="-O3" \ >> EX=$(EX) BINROOT=$(BINROOT) $(TYPE) >> >> This worked fine too, >> the problem occurs when i want to run a job with >> mpiexec -n 4 ./DLPOLY.Z or >> mpirun -n 4 ./DLPOLY.z >> >> I get this error: >> -- >> [linux-6wa6:02927] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file >> orterun.c at line 543 markus@linux-6wa6:/media/808CCB178CCB069E/MD >> Simulations/Test Simu1> sudo mpiexec -n 4 ./DLPOLY.Z [linux-6wa6:03731] >> [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at >> line 125 >> -- >> It looks like orte_init failed for some reason; your parallel process is >> likely to abort. There are many reasons that a parallel process can fail >> during orte_init; some of which are due to configuration or environment >> problems. This failure appears to be an internal failure; here's some >> additional information (which may only be relevant to an Open MPI >> developer): >> >>orte_ess_base_select failed >>--> Returned value Not found (-13) instead of ORTE_SUCCESS >> -- >> [linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file >> orterun.c at line 543 >> >> >> Some Informations: >> I use Open MPI 1.4.4, Suse 64bit, AMD quadcore >> >> make check gives: >> make: *** No rule to make target `check'. Stop. >> I attached the ompi_info. >> >> Thx alot for your help, >> >> regards, >> Markus >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > Now i made open mpi new, but now im ggetting stuff like this: > > .. > /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined > reference to `lam_ssi_base_param_find' > /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined > reference to `asc_parse' > /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined > reference to `lam_ssi_base_param_register_string' > /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined > reference to `lam_ssi_base_param_register_int' > /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined > reference to `lampanic' > /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined > reference to `lam_thread_self' > /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined > reference to `lam_debug_close' > /usr/local/lib64/libmpi_f77.so: undefined reference to > `MPI_CONVERSION_FN_NULL' > /usr/local/lib64/libmpi_f77.so: undefined reference to `MPI_File_read_at_all' > /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined > reference to
Re: [OMPI users] open-mpi error
On 11/24/2011 10:08 PM, MM wrote: Hi I get the same error while linking against home built 1.5.4 openmpi libs on win32. I didn't get this error against the prebuilt libs. I see you use Suse. There probably is a openmpi.rpm or openmpi.dpkg already available for Suse which contains the libraries and you could link against those and that may work MM -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Markus Stiller Sent: 24 November 2011 20:41 To: us...@open-mpi.org Subject: [OMPI users] open-mpi error Hello, i have some problem with mpi, i looked in the FAQ and google already but i couldnt find a solution. To build mpi i used this: shell$ ./configure --prefix=/opt/mpirun <...lots of output...> shell$ make all install Worked fine so far. I am using dlpoly, and this makefile: $(MAKE) LD="mpif90 -o" LDFLAGS="-O3" \ FC="mpif90 -c" FCFLAGS="-O3" \ EX=$(EX) BINROOT=$(BINROOT) $(TYPE) This worked fine too, the problem occurs when i want to run a job with mpiexec -n 4 ./DLPOLY.Z or mpirun -n 4 ./DLPOLY.z I get this error: -- [linux-6wa6:02927] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c at line 543 markus@linux-6wa6:/media/808CCB178CCB069E/MD Simulations/Test Simu1> sudo mpiexec -n 4 ./DLPOLY.Z [linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 125 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- [linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c at line 543 Some Informations: I use Open MPI 1.4.4, Suse 64bit, AMD quadcore make check gives: make: *** No rule to make target `check'. Stop. I attached the ompi_info. Thx alot for your help, regards, Markus ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users Now i made open mpi new, but now im ggetting stuff like this: .. /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `lam_ssi_base_param_find' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `asc_parse' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `lam_ssi_base_param_register_string' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `lam_ssi_base_param_register_int' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `lampanic' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `lam_thread_self' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `lam_debug_close' /usr/local/lib64/libmpi_f77.so: undefined reference to `MPI_CONVERSION_FN_NULL' /usr/local/lib64/libmpi_f77.so: undefined reference to `MPI_File_read_at_all' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `sfh_sock_set_buf_size' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `blktype' /usr/local/lib64/libmpi_f77.so: undefined reference to `MPI_File_preallocate' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `ao_init' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `lam_mutex_destroy' /usr/local/lib64/libmpi_f77.so: undefined reference to `MPI_File_iread_shared' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `al_init' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `stoi' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `lam_ssi_base_hostmap' /usr/local/lib64/libmpi_f77.so: undefined reference to `MPI_FORTRAN_ERRCODES_IGNORE' /usr/local/lib64/libmpi_f77.so: undefined reference to `MPI_File_close' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `al_next' /usr/local/lib64/libmpi_f77.so: undefined reference to `MPI_Register_datarep' /usr/local/lib64/libmpi_f77.so: undefined reference to `MPI_FORTRAN_STATUSES_IGNORE' /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined reference to `nid_parse'
Re: [OMPI users] open-mpi error
Hi I get the same error while linking against home built 1.5.4 openmpi libs on win32. I didn't get this error against the prebuilt libs. I see you use Suse. There probably is a openmpi.rpm or openmpi.dpkg already available for Suse which contains the libraries and you could link against those and that may work MM -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Markus Stiller Sent: 24 November 2011 20:41 To: us...@open-mpi.org Subject: [OMPI users] open-mpi error Hello, i have some problem with mpi, i looked in the FAQ and google already but i couldnt find a solution. To build mpi i used this: shell$ ./configure --prefix=/opt/mpirun <...lots of output...> shell$ make all install Worked fine so far. I am using dlpoly, and this makefile: $(MAKE) LD="mpif90 -o" LDFLAGS="-O3" \ FC="mpif90 -c" FCFLAGS="-O3" \ EX=$(EX) BINROOT=$(BINROOT) $(TYPE) This worked fine too, the problem occurs when i want to run a job with mpiexec -n 4 ./DLPOLY.Z or mpirun -n 4 ./DLPOLY.z I get this error: -- [linux-6wa6:02927] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c at line 543 markus@linux-6wa6:/media/808CCB178CCB069E/MD Simulations/Test Simu1> sudo mpiexec -n 4 ./DLPOLY.Z [linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 125 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- [linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c at line 543 Some Informations: I use Open MPI 1.4.4, Suse 64bit, AMD quadcore make check gives: make: *** No rule to make target `check'. Stop. I attached the ompi_info. Thx alot for your help, regards, Markus
[OMPI users] open-mpi error
Hello, i have some problem with mpi, i looked in the FAQ and google already but i couldnt find a solution. To build mpi i used this: shell$ ./configure --prefix=/opt/mpirun <...lots of output...> shell$ make all install Worked fine so far. I am using dlpoly, and this makefile: $(MAKE) LD="mpif90 -o" LDFLAGS="-O3" \ FC="mpif90 -c" FCFLAGS="-O3" \ EX=$(EX) BINROOT=$(BINROOT) $(TYPE) This worked fine too, the problem occurs when i want to run a job with mpiexec -n 4 ./DLPOLY.Z or mpirun -n 4 ./DLPOLY.z I get this error: -- [linux-6wa6:02927] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c at line 543 markus@linux-6wa6:/media/808CCB178CCB069E/MD Simulations/Test Simu1> sudo mpiexec -n 4 ./DLPOLY.Z [linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 125 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- [linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c at line 543 Some Informations: I use Open MPI 1.4.4, Suse 64bit, AMD quadcore make check gives: make: *** No rule to make target `check'. Stop. I attached the ompi_info. Thx alot for your help, regards, Markus markus@linux-6wa6:/media/808CCB178CCB069E/MD Simulations/Test Simu1> ompi_info --all Package: Open MPI abuild@build08 Distribution Open MPI: 1.4.3 Open MPI SVN revision: r23834 Open MPI release date: Oct 05, 2010 Open RTE: 1.4.3 Open RTE SVN revision: r23834 Open RTE release date: Oct 05, 2010 OPAL: 1.4.3 OPAL SVN revision: r23834 OPAL release date: Oct 05, 2010 Ident string: 1.4.3 MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.4.3) MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.4.3) MCA timer: linux (MCA v2.0, API v2.0, Component v1.4.3) MCA installdirs: env (MCA v2.0, API v2.0, Component v1.4.3) MCA installdirs: config (MCA v2.0, API v2.0, Component v1.4.3) Prefix: /usr/lib64/mpi/gcc/openmpi Exec_prefix: /usr/lib64/mpi/gcc/openmpi Bindir: /usr/lib64/mpi/gcc/openmpi/bin Sbindir: /usr/lib64/mpi/gcc/openmpi/sbin Libdir: /usr/lib64/mpi/gcc/openmpi/lib64 Incdir: /usr/lib64/mpi/gcc/openmpi/include Mandir: /usr/lib64/mpi/gcc/openmpi/share/man Pkglibdir: /usr/lib64/mpi/gcc/openmpi/lib64/openmpi Libexecdir: /usr/lib64/mpi/gcc/openmpi/lib Datarootdir: /usr/lib64/mpi/gcc/openmpi/share Datadir: /usr/lib64/mpi/gcc/openmpi/share Sysconfdir: /etc Sharedstatedir: /usr/lib64/mpi/gcc/openmpi/com Localstatedir: /var Infodir: /usr/lib64/mpi/gcc/openmpi/share/info Pkgdatadir: /usr/lib64/mpi/gcc/openmpi/share/openmpi Pkglibdir: /usr/lib64/mpi/gcc/openmpi/lib64/openmpi Pkgincludedir: /usr/lib64/mpi/gcc/openmpi/include/openmpi Configured architecture: x86_64-suse-linux-gnu Configure host: build08 Configured by: abuild Configured on: Sat Oct 29 15:50:22 UTC 2011 Configure host: build08 Built by: abuild Built on: Sat Oct 29 16:04:18 UTC 2011 Built host: build08 C bindings: yes C++ bindings: yes Fortran77 bindings: yes (all) Fortran90 bindings: yes Fortran90 bindings size: small C compiler: gcc C compiler absolute: /usr/bin/gcc C char size: 1 C bool size: 1 C short size: 2 C int size: 4 C long size: 8 C float size: 4 C double size: 8 C pointer size: 8 C char align: 1 C bool align: 1 C int align: 4 C float align: 4 C double align: 8 C++ compiler: g++ C++ compiler absolute: /usr/bin/g++ Fortran77 compiler: gfortran Fortran77 compiler abs: /usr/bin/gfortran Fortran90 compiler: gfortran Fortran90 compiler abs: /usr/bin/gfortran Fort integer size: 4 Fort logical size: 4 Fort logical value true: 1 Fort have integer1: yes Fort have
Re: [OMPI users] How are the Open MPI processes spawned?
On Nov 24, 2011, at 11:49 AM, Paul Kapinos wrote: > Hello Ralph, Terry, all! > > again, two news: the good one and the second one. > > Ralph Castain wrote: >> Yes, that would indeed break things. The 1.5 series isn't correctly checking >> connections across multiple interfaces until it finds one that works - it >> just uses the first one it sees. :-( > > Yahhh!! > This behaviour - catch a random interface and hang forever if something is > wrong with it - is somewhat less than perfect. > > From my perspective - the users one - OpenMPI should try to use eitcher *all* > available networks (as 1.4 it does...), starting with the high performance > ones, or *only* those interfaces on which the hostnames from the hostfile are > bound to. It is indeed supposed to do the former - as I implied, this is a bug in the 1.5 series. > > Also, there should be timeouts (if you cannot connect to a node within a > minute you probably will never ever be connected...) We have debated about this for some time - there is a timeout mca param one can set, but we'll consider again making it default. > > If some connection runs into a timeout a warning would be great (and a hint > to take off the interface by oob_tcp_if_exclude, btl_tcp_if_exclude). > > Should it not? > Maybe you can file it as a "call for enhancement"... Probably the right approach at this time. > > > >> The solution is to specify -mca oob_tcp_if_include ib0. This will direct the >> run-time wireup across the IP over IB interface. >> You will also need the -mca btl_tcp_if_include ib0 as well so the MPI comm >> goes exclusively over that network. > > YES! This works. Adding > -mca oob_tcp_if_include ib0 -mca btl_tcp_if_include ib0 > to the command line of mpiexec helps me to run the 1.5.x programs, so I > believe this is the workaround. > > Many thanks for this hint, Ralph! My fail to not to find it in the FAQ (I was > so close :o) http://www.open-mpi.org/faq/?category=tcp#tcp-selection > > But then I ran into yet another one issue. In > http://www.open-mpi.org/faq/?category=tuning#setting-mca-params > the way to define MCA parameters over environment variables is described. > > I tried it: > $ export OMPI_MCA_oob_tcp_if_include=ib0 > $ export OMPI_MCA_btl_tcp_if_include=ib0 > > > I checked it: > $ ompi_info --param all all | grep oob_tcp_if_include > MCA oob: parameter "oob_tcp_if_include" (current value: > , data source: environment or cmdline) > $ ompi_info --param all all | grep btl_tcp_if_include > MCA btl: parameter "btl_tcp_if_include" (current value: > , data source: environment or cmdline) > > > But then I get again the hang-up issue! > > ==> seem, mpiexec does not understand these environment variables! and only > get the command line options. This should not be so? No, that isn't what is happening. The problem lies in the behavior of rsh/ssh. This environment does not forward environmental variables. Because of limits on cmd line length, we don't automatically forward MCA params from the environment, but only from the cmd line. It is an annoying limitation, but one outside our control. Put those envars in the default mca param file and the problem will be resolved. > > (I also tried to advise to provide the envvars by -x > OMPI_MCA_oob_tcp_if_include -x OMPI_MCA_btl_tcp_if_include - nothing changed. I'm surprised by that - they should be picked up and forwarded. Could be a bug > Well, they are OMPI_ variables and should be provided in any case). No, they aren't - they are not treated differently than any other envar. > > > Best wishes and many thanks for all, > > Paul Kapinos > > > > >> Specifying both include and exclude should generate an error as those are >> mutually exclusive options - I think this was also missed in early 1.5 >> releases and was recently patched. >> HTH >> Ralph >> On Nov 23, 2011, at 12:14 PM, TERRY DONTJE wrote: >>> On 11/23/2011 2:02 PM, Paul Kapinos wrote: Hello Ralph, hello all, Two news, as usual a good and a bad one. The good: we believe to find out *why* it hangs The bad: it seem for me, this is a bug or at least undocumented feature of Open MPI /1.5.x. In detail: As said, we see mystery hang-ups if starting on some nodes using some permutation of hostnames. Usually removing "some bad" nodes helps, sometimes a permutation of node names in the hostfile is enough(!). The behaviour is reproducible. The machines have at least 2 networks: *eth0* is used for installation, monitoring, ... - this ethernet is very slim *ib0* - is the "IP over IB" interface and is used for everything: the file systems, ssh and so on. The hostnames are bound to the ib0 network; our idea was not to use eth0 for MPI at all. all machines are available from any over ib0 (are in one network). But on eth0 there
Re: [OMPI users] How are the Open MPI processes spawned?
Hello Ralph, Terry, all! again, two news: the good one and the second one. Ralph Castain wrote: Yes, that would indeed break things. The 1.5 series isn't correctly checking connections across multiple interfaces until it finds one that works - it just uses the first one it sees. :-( Yahhh!! This behaviour - catch a random interface and hang forever if something is wrong with it - is somewhat less than perfect. From my perspective - the users one - OpenMPI should try to use eitcher *all* available networks (as 1.4 it does...), starting with the high performance ones, or *only* those interfaces on which the hostnames from the hostfile are bound to. Also, there should be timeouts (if you cannot connect to a node within a minute you probably will never ever be connected...) If some connection runs into a timeout a warning would be great (and a hint to take off the interface by oob_tcp_if_exclude, btl_tcp_if_exclude). Should it not? Maybe you can file it as a "call for enhancement"... The solution is to specify -mca oob_tcp_if_include ib0. This will direct the run-time wireup across the IP over IB interface. You will also need the -mca btl_tcp_if_include ib0 as well so the MPI comm goes exclusively over that network. YES! This works. Adding -mca oob_tcp_if_include ib0 -mca btl_tcp_if_include ib0 to the command line of mpiexec helps me to run the 1.5.x programs, so I believe this is the workaround. Many thanks for this hint, Ralph! My fail to not to find it in the FAQ (I was so close :o) http://www.open-mpi.org/faq/?category=tcp#tcp-selection But then I ran into yet another one issue. In http://www.open-mpi.org/faq/?category=tuning#setting-mca-params the way to define MCA parameters over environment variables is described. I tried it: $ export OMPI_MCA_oob_tcp_if_include=ib0 $ export OMPI_MCA_btl_tcp_if_include=ib0 I checked it: $ ompi_info --param all all | grep oob_tcp_if_include MCA oob: parameter "oob_tcp_if_include" (current value: , data source: environment or cmdline) $ ompi_info --param all all | grep btl_tcp_if_include MCA btl: parameter "btl_tcp_if_include" (current value: , data source: environment or cmdline) But then I get again the hang-up issue! ==> seem, mpiexec does not understand these environment variables! and only get the command line options. This should not be so? (I also tried to advise to provide the envvars by -x OMPI_MCA_oob_tcp_if_include -x OMPI_MCA_btl_tcp_if_include - nothing changed. Well, they are OMPI_ variables and should be provided in any case). Best wishes and many thanks for all, Paul Kapinos Specifying both include and exclude should generate an error as those are mutually exclusive options - I think this was also missed in early 1.5 releases and was recently patched. HTH Ralph On Nov 23, 2011, at 12:14 PM, TERRY DONTJE wrote: On 11/23/2011 2:02 PM, Paul Kapinos wrote: Hello Ralph, hello all, Two news, as usual a good and a bad one. The good: we believe to find out *why* it hangs The bad: it seem for me, this is a bug or at least undocumented feature of Open MPI /1.5.x. In detail: As said, we see mystery hang-ups if starting on some nodes using some permutation of hostnames. Usually removing "some bad" nodes helps, sometimes a permutation of node names in the hostfile is enough(!). The behaviour is reproducible. The machines have at least 2 networks: *eth0* is used for installation, monitoring, ... - this ethernet is very slim *ib0* - is the "IP over IB" interface and is used for everything: the file systems, ssh and so on. The hostnames are bound to the ib0 network; our idea was not to use eth0 for MPI at all. all machines are available from any over ib0 (are in one network). But on eth0 there are at least two different networks; especially the computer linuxbsc025 is in different network than the others and is not reachable from other nodes over eth0! (but reachable over ib0. The name used in the hostfile is resolved to the IP of ib0 ). So I believe that Open MPI /1.5.x tries to communicate over eth0 and cannot do it, and hangs. The /1.4.3 does not hang, so this issue is 1.5.x-specific (seen in 1.5.3 and 1.5.4). A bug? I also tried to disable the eth0 completely: $ mpiexec -mca btl_tcp_if_exclude eth0,lo -mca btl_tcp_if_include ib0 ... I believe if you give "-mca btl_tcp_if_include ib0" you do not need to specify the exclude parameter. ...but this does not help. All right, the above command should disable the usage of eth0 for MPI communication itself, but it hangs just before the MPI is started, isn't it? (because one process lacks, the MPI_INIT cannot be passed) By "just before the MPI is started" do you mean while orte is launching the processes. I wonder if you need to specify "-mca oob_tcp_if_include ib0" also but I think that may depend on which oob you are using. Now a question: is there a way to forbid the mpiexec to
Re: [OMPI users] orte_debugger_select and orte_ess_set_name failed
Hi MM, Sorry for the delayed reply, I was busy in a meeting these days. The log files seem not very helpful to solve the problem. May be your CMakeCache.txt file would help. Currently we don't provided binaries built from trunk. Have you also tried the 1.5.x binaries? Best Regards, Shiqing On 2011-11-23 10:08 PM, MM wrote: Hi Shiqing, Is the info provided useful to understand what's going on? Alternatively, is there a way to get the provided binaries for win but off trunk rather than off 1.5.4 as on the website, because I don't have this problem when I link against those libs, Thanks MM -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of MM Sent: 21 November 2011 21:08 To: f...@hlrs.de Cc: 'Open MPI Users' Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name failed Hi, I have placed the source in \Program Files\openmpi-1.5.4 the build dir in \Program Files\openmpi.build and the install dir in \Program Files\openmpi I could not find config.log in any of the 3 directories nor in the directory from which I run mpirun. The build log attached is a zip of all the .log under \Program Files\openmpi.build First, I installed the provided binaries on xp32bit, and successfully ran the program in Release mode. in debug mode, there was that error of some function missing in kernel, that you fixed in svn. Second, I then downloaded the source and built the static libraries w cmake according to README.windows, and against these home built libs, the same program run neithers in debug nor in release, because of the error below. How can I generate the config.log? About Debug/Release, thinking about it at this time, I don't really need the debug libs of openmpi. but to be able to link against vs2010 Release libs of openmpi, I need them to be linked against the Release c runtime, so I might as well link against the debug version of the openmpi libs. Your help is very appreciated, MM -Original Message- From: Shiqing Fan [mailto:f...@hlrs.de] Sent: 21 November 2011 12:48 To: Open MPI Users Cc: MM Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name failed Hi, Could you please send your config and build log to me? Have you tried with a simpler program? Does this error always happen? Regards, Shiqing On 2011-11-19 4:24 PM, MM wrote: Trying to run my program linked against debug 1.5.4 on vs2010 fails: mpirun -np 1 .\nhui\Debug\nhui.exe : -np 1 .\nhcomp\Debug\nhcomp.exe [PCNAME:04960] [[1282,0],0] ORTE_ERROR_LOG: Not found in file C:\Program Files\openmpi-1.5.4\orte\mca\ess\hnp\ess_hnp_module.c at line 536 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_debugger_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- [PCNAME:04960] [[1282,0],0] ORTE_ERROR_LOG: Not found in file C:\Program Files\openmpi-1.5.4\orte\runtime\orte_init.c at line 128 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_set_name failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- [LLDNRATDHY9H4J:04960] [[1282,0],0] ORTE_ERROR_LOG: Not found in file C:\Program Files\openmpi-1.5.4\orte\tools\orterun\orterun.c at line 616 any help is appreciated, MM ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- --- Shiqing Fan High Performance Computing Center Stuttgart (HLRS) Tel: ++49(0)711-685-87234 Nobelstrasse 19 Fax: ++49(0)711-685-65832 70569 Stuttgart http://www.hlrs.de/organization/people/shiqing-fan/ email: f...@hlrs.de * ** ** ** WARNING: This email contains an attachment of a very suspicious type. ** ** You are urged NOT to open this attachment unless you are absolutely ** ** sure it is legitimate. Opening this attachment may cause irreparable ** ** damage to your computer and your files. If you have
Re: [OMPI users] Accessing OpenMPI processes over Internet using ssh
Hi, Am 24.11.2011 um 05:26 schrieb Jaison Paul: > I am trying to access OpenMPI processes over Internet using ssh and not quite > successful, yet. I believe that I should be able to do it. > > I have to run one process on my PC and the rest on a remote cluster over > internet. I have set the public keys (at .ssh/authorized_keys) to access > remote nodes without a password. > > I use hostfile to run mpi. It will read something like: > - > localhost > u...@remotehost.com this is not a valid syntax for Open MPI. > - > But it fails. > > The issue seems to be the user! That is, the user on my PC is different to > that of user at remotehosts. That's my assumption. > > Is this the problem? Is there any work-around to solve this issue? Do I need > to have same username at all nodes to solve this issue? You can define nicknames for an ssh connection in a file ~/.ssh/config like: Host foobar User baz Hostname the.remote.server.demo Port 1234 While this will work with any nickname for an ssh connection, in your case the nickname must match the one specified in the hostfile, as Open MPI won't use this lookup file: Host remotehost.com User user ssh should then use the entries therein to initiate the connection. For details you can have a look at `man ssh_config`. -- Reuti