Re: [OMPI users] Accessing OpenMPI processes over Internet using ssh

2011-11-24 Thread Ralph Castain

On Nov 24, 2011, at 2:00 AM, Reuti wrote:

> Hi,
> 
> Am 24.11.2011 um 05:26 schrieb Jaison Paul:
> 
>> I am trying to access OpenMPI processes over Internet using ssh and not 
>> quite successful, yet. I believe that I should be able to do it.
>> 
>> I have to run one process on my PC and the rest on a remote cluster over 
>> internet. I have set the public keys (at .ssh/authorized_keys) to access 
>> remote nodes without a password.
>> 
>> I use hostfile to run mpi. It will read something like:
>> -
>> localhost
>> u...@remotehost.com
> 
> this is not a valid syntax for Open MPI.

This isn't correct - we have long supported that syntax in a hostfile, and 
there is no issue with having a different user name at each node.

Jaison: are you sure your nodes are setup for password-less ssh? In other 
words, have you setup your .ssh files on the remote nodes so they will allow us 
to ssh a process on them without providing a password? This is the typical 
problem we see.


> 
> 
>> -
>> But it fails.
>> 
>> The issue seems to be the user! That is, the user on my PC is different to 
>> that of user at remotehosts. That's my assumption.
>> 
>> Is this the problem? Is there any work-around to solve this issue? Do I need 
>> to have same username at all nodes to solve this issue?
> 
> You can define nicknames for an ssh connection in a file ~/.ssh/config like:
> 
> Host foobar
>User baz
>Hostname the.remote.server.demo
>Port 1234
> 
> While this will work with any nickname for an ssh connection, in your case 
> the nickname must match the one specified in the hostfile, as Open MPI won't 
> use this lookup file:
> 
> Host remotehost.com
>User user
> 
> ssh should then use the entries therein to initiate the connection. For 
> details you can have a look at `man ssh_config`.
> 
> -- Reuti
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] open-mpi error

2011-11-24 Thread Ralph Castain
Hi Markus

You have some major problems with confused installations of MPIs. First, you 
cannot compile an application against MPICH and expect to run it with OMPI - 
the two are not binary compatible. You need to compile against the MPI 
installation you intend to run against.

Second, your errors appear to be because you are not pointing your library path 
at the OMPI installation, and so the libraries are not being found. You need to 
set LD_LIBRARY_PATH to include the path to where you installed OMPI. Based on 
the configure line you give, that would mean ensuring that /opt/mpirun/lib was 
in that envar. Likewise, /opt/mpirun/bin needs to be in your PATH.

Once you have those correctly set, and build your app against the appropriate 
mpicc, you should be able to run.

BTW: your last message indicates that you built against an old LAM MPI, so you 
appear to have some pretty old software laying around. Perhaps cleaning out 
some of the old MPI installations would help.


On Nov 24, 2011, at 4:32 PM, Markus Stiller wrote:

> On 11/24/2011 10:08 PM, MM wrote:
>> Hi
>> 
>> I get the same error while linking against home built 1.5.4 openmpi libs on
>> win32.
>> I didn't get this error against the prebuilt libs.
>> 
>> I see you use Suse. There probably is a openmpi.rpm or openmpi.dpkg already
>> available for Suse which contains the libraries and you could link against
>> those and that may work
>> 
>> MM
>> 
>> -Original Message-
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>> Behalf Of Markus Stiller
>> Sent: 24 November 2011 20:41
>> To: us...@open-mpi.org
>> Subject: [OMPI users] open-mpi error
>> 
>> Hello,
>> 
>> i have some problem with mpi, i looked in the FAQ and google already but i
>> couldnt find a solution.
>> 
>> To build mpi i used this:
>> shell$ ./configure --prefix=/opt/mpirun
>> <...lots of output...>
>> shell$ make all install
>> 
>> Worked fine so far. I am using dlpoly, and this makefile:
>>  $(MAKE) LD="mpif90 -o" LDFLAGS="-O3" \
>>  FC="mpif90 -c" FCFLAGS="-O3" \
>>  EX=$(EX) BINROOT=$(BINROOT) $(TYPE)
>> 
>> This worked fine too,
>> the problem occurs when i want to run a job with
>> mpiexec -n 4 ./DLPOLY.Z   or
>> mpirun -n 4 ./DLPOLY.z
>> 
>> I get this error:
>> --
>> [linux-6wa6:02927] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
>> orterun.c at line 543 markus@linux-6wa6:/media/808CCB178CCB069E/MD
>> Simulations/Test Simu1>  sudo mpiexec -n 4 ./DLPOLY.Z [linux-6wa6:03731]
>> [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at
>> line 125
>> --
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can fail
>> during orte_init; some of which are due to configuration or environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>> 
>>orte_ess_base_select failed
>>-->  Returned value Not found (-13) instead of ORTE_SUCCESS
>> --
>> [linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
>> orterun.c at line 543
>> 
>> 
>> Some Informations:
>> I use Open MPI 1.4.4, Suse 64bit, AMD quadcore
>> 
>> make check gives:
>> make: *** No rule to make target `check'.  Stop.
>> I attached the ompi_info.
>> 
>> Thx alot for your help,
>> 
>> regards,
>> Markus
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> 
> Now i made open mpi new, but now im ggetting stuff like this:
> 
> ..
> /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined 
> reference to `lam_ssi_base_param_find'
> /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined 
> reference to `asc_parse'
> /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined 
> reference to `lam_ssi_base_param_register_string'
> /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined 
> reference to `lam_ssi_base_param_register_int'
> /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined 
> reference to `lampanic'
> /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined 
> reference to `lam_thread_self'
> /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined 
> reference to `lam_debug_close'
> /usr/local/lib64/libmpi_f77.so: undefined reference to 
> `MPI_CONVERSION_FN_NULL'
> /usr/local/lib64/libmpi_f77.so: undefined reference to `MPI_File_read_at_all'
> /usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: undefined 
> reference to 

Re: [OMPI users] open-mpi error

2011-11-24 Thread Markus Stiller

On 11/24/2011 10:08 PM, MM wrote:

Hi

I get the same error while linking against home built 1.5.4 openmpi libs on
win32.
I didn't get this error against the prebuilt libs.

I see you use Suse. There probably is a openmpi.rpm or openmpi.dpkg already
available for Suse which contains the libraries and you could link against
those and that may work

MM

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Markus Stiller
Sent: 24 November 2011 20:41
To: us...@open-mpi.org
Subject: [OMPI users] open-mpi error

Hello,

i have some problem with mpi, i looked in the FAQ and google already but i
couldnt find a solution.

To build mpi i used this:
shell$ ./configure --prefix=/opt/mpirun
<...lots of output...>
shell$ make all install

Worked fine so far. I am using dlpoly, and this makefile:
  $(MAKE) LD="mpif90 -o" LDFLAGS="-O3" \
  FC="mpif90 -c" FCFLAGS="-O3" \
  EX=$(EX) BINROOT=$(BINROOT) $(TYPE)

This worked fine too,
the problem occurs when i want to run a job with
mpiexec -n 4 ./DLPOLY.Z   or
mpirun -n 4 ./DLPOLY.z

I get this error:
--
[linux-6wa6:02927] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
orterun.c at line 543 markus@linux-6wa6:/media/808CCB178CCB069E/MD
Simulations/Test Simu1>  sudo mpiexec -n 4 ./DLPOLY.Z [linux-6wa6:03731]
[[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at
line 125
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can fail
during orte_init; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

orte_ess_base_select failed
-->  Returned value Not found (-13) instead of ORTE_SUCCESS
--
[linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
orterun.c at line 543


Some Informations:
I use Open MPI 1.4.4, Suse 64bit, AMD quadcore

make check gives:
make: *** No rule to make target `check'.  Stop.
I attached the ompi_info.

Thx alot for your help,

regards,
Markus


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Now i made open mpi new, but now im ggetting stuff like this:

..
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `lam_ssi_base_param_find'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `asc_parse'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `lam_ssi_base_param_register_string'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `lam_ssi_base_param_register_int'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `lampanic'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `lam_thread_self'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `lam_debug_close'
/usr/local/lib64/libmpi_f77.so: undefined reference to 
`MPI_CONVERSION_FN_NULL'
/usr/local/lib64/libmpi_f77.so: undefined reference to 
`MPI_File_read_at_all'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `sfh_sock_set_buf_size'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `blktype'
/usr/local/lib64/libmpi_f77.so: undefined reference to 
`MPI_File_preallocate'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `ao_init'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `lam_mutex_destroy'
/usr/local/lib64/libmpi_f77.so: undefined reference to 
`MPI_File_iread_shared'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `al_init'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `stoi'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `lam_ssi_base_hostmap'
/usr/local/lib64/libmpi_f77.so: undefined reference to 
`MPI_FORTRAN_ERRCODES_IGNORE'

/usr/local/lib64/libmpi_f77.so: undefined reference to `MPI_File_close'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `al_next'
/usr/local/lib64/libmpi_f77.so: undefined reference to 
`MPI_Register_datarep'
/usr/local/lib64/libmpi_f77.so: undefined reference to 
`MPI_FORTRAN_STATUSES_IGNORE'
/usr/lib64/gcc/x86_64-suse-linux/4.6/../../../../lib64/libmpi.so: 
undefined reference to `nid_parse'

Re: [OMPI users] open-mpi error

2011-11-24 Thread MM
Hi

I get the same error while linking against home built 1.5.4 openmpi libs on
win32.
I didn't get this error against the prebuilt libs.

I see you use Suse. There probably is a openmpi.rpm or openmpi.dpkg already
available for Suse which contains the libraries and you could link against
those and that may work

MM

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Markus Stiller
Sent: 24 November 2011 20:41
To: us...@open-mpi.org
Subject: [OMPI users] open-mpi error

Hello,

i have some problem with mpi, i looked in the FAQ and google already but i
couldnt find a solution.

To build mpi i used this:
shell$ ./configure --prefix=/opt/mpirun
<...lots of output...>
shell$ make all install

Worked fine so far. I am using dlpoly, and this makefile:
 $(MAKE) LD="mpif90 -o" LDFLAGS="-O3" \
 FC="mpif90 -c" FCFLAGS="-O3" \
 EX=$(EX) BINROOT=$(BINROOT) $(TYPE)

This worked fine too,
the problem occurs when i want to run a job with
mpiexec -n 4 ./DLPOLY.Z   or
mpirun -n 4 ./DLPOLY.z

I get this error:
--
[linux-6wa6:02927] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
orterun.c at line 543 markus@linux-6wa6:/media/808CCB178CCB069E/MD
Simulations/Test Simu1> sudo mpiexec -n 4 ./DLPOLY.Z [linux-6wa6:03731]
[[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at
line 125
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can fail
during orte_init; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

   orte_ess_base_select failed
   --> Returned value Not found (-13) instead of ORTE_SUCCESS
--
[linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
orterun.c at line 543


Some Informations:
I use Open MPI 1.4.4, Suse 64bit, AMD quadcore

make check gives:
make: *** No rule to make target `check'.  Stop.
I attached the ompi_info.

Thx alot for your help,

regards,
Markus




[OMPI users] open-mpi error

2011-11-24 Thread Markus Stiller

Hello,

i have some problem with mpi, i looked in the FAQ and google already but 
i couldnt find a solution.


To build mpi i used this:
shell$ ./configure --prefix=/opt/mpirun
<...lots of output...>
shell$ make all install

Worked fine so far. I am using dlpoly, and this makefile:
$(MAKE) LD="mpif90 -o" LDFLAGS="-O3" \
FC="mpif90 -c" FCFLAGS="-O3" \
EX=$(EX) BINROOT=$(BINROOT) $(TYPE)

This worked fine too,
the problem occurs when i want to run a job with
mpiexec -n 4 ./DLPOLY.Z   or
mpirun -n 4 ./DLPOLY.z

I get this error:
--
[linux-6wa6:02927] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
orterun.c at line 543
markus@linux-6wa6:/media/808CCB178CCB069E/MD Simulations/Test Simu1> 
sudo mpiexec -n 4 ./DLPOLY.Z
[linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
runtime/orte_init.c at line 125

--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--
[linux-6wa6:03731] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
orterun.c at line 543



Some Informations:
I use Open MPI 1.4.4, Suse 64bit, AMD quadcore

make check gives:
make: *** No rule to make target `check'.  Stop.
I attached the ompi_info.

Thx alot for your help,

regards,
Markus

markus@linux-6wa6:/media/808CCB178CCB069E/MD Simulations/Test Simu1> ompi_info 
--all
 Package: Open MPI abuild@build08 Distribution
Open MPI: 1.4.3
   Open MPI SVN revision: r23834
   Open MPI release date: Oct 05, 2010
Open RTE: 1.4.3
   Open RTE SVN revision: r23834
   Open RTE release date: Oct 05, 2010
OPAL: 1.4.3
   OPAL SVN revision: r23834
   OPAL release date: Oct 05, 2010
Ident string: 1.4.3
   MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.4.3)
  MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.4.3)
   MCA timer: linux (MCA v2.0, API v2.0, Component v1.4.3)
 MCA installdirs: env (MCA v2.0, API v2.0, Component v1.4.3)
 MCA installdirs: config (MCA v2.0, API v2.0, Component v1.4.3)
  Prefix: /usr/lib64/mpi/gcc/openmpi
 Exec_prefix: /usr/lib64/mpi/gcc/openmpi
  Bindir: /usr/lib64/mpi/gcc/openmpi/bin
 Sbindir: /usr/lib64/mpi/gcc/openmpi/sbin
  Libdir: /usr/lib64/mpi/gcc/openmpi/lib64
  Incdir: /usr/lib64/mpi/gcc/openmpi/include
  Mandir: /usr/lib64/mpi/gcc/openmpi/share/man
   Pkglibdir: /usr/lib64/mpi/gcc/openmpi/lib64/openmpi
  Libexecdir: /usr/lib64/mpi/gcc/openmpi/lib
 Datarootdir: /usr/lib64/mpi/gcc/openmpi/share
 Datadir: /usr/lib64/mpi/gcc/openmpi/share
  Sysconfdir: /etc
  Sharedstatedir: /usr/lib64/mpi/gcc/openmpi/com
   Localstatedir: /var
 Infodir: /usr/lib64/mpi/gcc/openmpi/share/info
  Pkgdatadir: /usr/lib64/mpi/gcc/openmpi/share/openmpi
   Pkglibdir: /usr/lib64/mpi/gcc/openmpi/lib64/openmpi
   Pkgincludedir: /usr/lib64/mpi/gcc/openmpi/include/openmpi
 Configured architecture: x86_64-suse-linux-gnu
  Configure host: build08
   Configured by: abuild
   Configured on: Sat Oct 29 15:50:22 UTC 2011
  Configure host: build08
Built by: abuild
Built on: Sat Oct 29 16:04:18 UTC 2011
  Built host: build08
  C bindings: yes
C++ bindings: yes
  Fortran77 bindings: yes (all)
  Fortran90 bindings: yes
 Fortran90 bindings size: small
  C compiler: gcc
 C compiler absolute: /usr/bin/gcc
 C char size: 1
 C bool size: 1
C short size: 2
  C int size: 4
 C long size: 8
C float size: 4
   C double size: 8
  C pointer size: 8
C char align: 1
C bool align: 1
 C int align: 4
   C float align: 4
  C double align: 8
C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
  Fortran77 compiler: gfortran
  Fortran77 compiler abs: /usr/bin/gfortran
  Fortran90 compiler: gfortran
  Fortran90 compiler abs: /usr/bin/gfortran
   Fort integer size: 4
   Fort logical size: 4
 Fort logical value true: 1
  Fort have integer1: yes
  Fort have 

Re: [OMPI users] How are the Open MPI processes spawned?

2011-11-24 Thread Ralph Castain

On Nov 24, 2011, at 11:49 AM, Paul Kapinos wrote:

> Hello Ralph, Terry, all!
> 
> again, two news: the good one and the second one.
> 
> Ralph Castain wrote:
>> Yes, that would indeed break things. The 1.5 series isn't correctly checking 
>> connections across multiple interfaces until it finds one that works - it 
>> just uses the first one it sees. :-(
> 
> Yahhh!!
> This behaviour - catch a random interface and hang forever if something is 
> wrong with it - is somewhat less than perfect.
> 
> From my perspective - the users one - OpenMPI should try to use eitcher *all* 
> available networks (as 1.4 it does...), starting with the high performance 
> ones, or *only* those interfaces on which the hostnames from the hostfile are 
> bound to.

It is indeed supposed to do the former - as I implied, this is a bug in the 1.5 
series.

> 
> Also, there should be timeouts (if you cannot connect to a node within a 
> minute you probably will never ever be connected...)

We have debated about this for some time - there is a timeout mca param one can 
set, but we'll consider again making it default.

> 
> If some connection runs into a timeout a warning would be great (and a hint 
> to take off the interface by oob_tcp_if_exclude, btl_tcp_if_exclude).
> 
> Should it not?
> Maybe you can file it as a "call for enhancement"...

Probably the right approach at this time.

> 
> 
> 
>> The solution is to specify -mca oob_tcp_if_include ib0. This will direct the 
>> run-time wireup across the IP over IB interface.
>> You will also need the -mca btl_tcp_if_include ib0 as well so the MPI comm 
>> goes exclusively over that network. 
> 
> YES! This works. Adding
> -mca oob_tcp_if_include ib0 -mca btl_tcp_if_include ib0
> to the command line of mpiexec helps me to run the 1.5.x programs, so I 
> believe this is the workaround.
> 
> Many thanks for this hint, Ralph! My fail to not to find it in the FAQ (I was 
> so close :o) http://www.open-mpi.org/faq/?category=tcp#tcp-selection
> 
> But then I ran into yet another one issue. In 
> http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
> the way to define MCA parameters over environment variables is described.
> 
> I tried it:
> $ export OMPI_MCA_oob_tcp_if_include=ib0
> $ export OMPI_MCA_btl_tcp_if_include=ib0
> 
> 
> I checked it:
> $ ompi_info --param all all | grep oob_tcp_if_include
> MCA oob: parameter "oob_tcp_if_include" (current value: 
> , data source: environment or cmdline)
> $ ompi_info --param all all | grep btl_tcp_if_include
> MCA btl: parameter "btl_tcp_if_include" (current value: 
> , data source: environment or cmdline)
> 
> 
> But then I get again the hang-up issue!
> 
> ==> seem, mpiexec does not understand these environment variables! and only 
> get the command line options. This should not be so?

No, that isn't what is happening. The problem lies in the behavior of rsh/ssh. 
This environment does not forward environmental variables. Because of limits on 
cmd line length, we don't automatically forward MCA params from the 
environment, but only from the cmd line. It is an annoying limitation, but one 
outside our control.

Put those envars in the default mca param file and the problem will be resolved.

> 
> (I also tried to advise to provide the envvars by -x 
> OMPI_MCA_oob_tcp_if_include -x OMPI_MCA_btl_tcp_if_include - nothing changed.

I'm surprised by that - they should be picked up and forwarded. Could be a bug

> Well, they are OMPI_ variables and should be provided in any case).

No, they aren't - they are not treated differently than any other envar.

> 
> 
> Best wishes and many thanks for all,
> 
> Paul Kapinos
> 
> 
> 
> 
>> Specifying both include and exclude should generate an error as those are 
>> mutually exclusive options - I think this was also missed in early 1.5 
>> releases and was recently patched.
>> HTH
>> Ralph
>> On Nov 23, 2011, at 12:14 PM, TERRY DONTJE wrote:
>>> On 11/23/2011 2:02 PM, Paul Kapinos wrote:
 Hello Ralph, hello all,
 
 Two news, as usual a good and a bad one.
 
 The good: we believe to find out *why* it hangs
 
 The bad: it seem for me, this is a bug or at least undocumented feature of 
 Open MPI /1.5.x.
 
 In detail:
 As said, we see mystery hang-ups if starting on some nodes using some 
 permutation of hostnames. Usually removing "some bad" nodes helps, 
 sometimes a permutation of node names in the hostfile is enough(!). The 
 behaviour is reproducible.
 
 The machines have at least 2 networks:
 
 *eth0* is used for installation, monitoring, ... - this ethernet is very 
 slim
 
 *ib0* - is the "IP over IB" interface and is used for everything: the file 
 systems, ssh and so on. The hostnames are bound to the ib0 network; our 
 idea was not to use eth0 for MPI at all.
 
 all machines are available from any over ib0 (are in one network).
 
 But on eth0 there 

Re: [OMPI users] How are the Open MPI processes spawned?

2011-11-24 Thread Paul Kapinos

Hello Ralph, Terry, all!

again, two news: the good one and the second one.

Ralph Castain wrote:
Yes, that would indeed break things. The 1.5 series isn't correctly 
checking connections across multiple interfaces until it finds one that 
works - it just uses the first one it sees. :-(


Yahhh!!
This behaviour - catch a random interface and hang forever if something 
is wrong with it - is somewhat less than perfect.


From my perspective - the users one - OpenMPI should try to use eitcher 
*all* available networks (as 1.4 it does...), starting with the high 
performance ones, or *only* those interfaces on which the hostnames from 
the hostfile are bound to.


Also, there should be timeouts (if you cannot connect to a node within a 
minute you probably will never ever be connected...)


If some connection runs into a timeout a warning would be great (and a 
hint to take off the interface by oob_tcp_if_exclude, btl_tcp_if_exclude).


Should it not?
Maybe you can file it as a "call for enhancement"...



The solution is to specify -mca oob_tcp_if_include ib0. This will direct 
the run-time wireup across the IP over IB interface.


You will also need the -mca btl_tcp_if_include ib0 as well so the MPI 
comm goes exclusively over that network. 


YES! This works. Adding
-mca oob_tcp_if_include ib0 -mca btl_tcp_if_include ib0
to the command line of mpiexec helps me to run the 1.5.x programs, so I 
believe this is the workaround.


Many thanks for this hint, Ralph! My fail to not to find it in the FAQ 
(I was so close :o) http://www.open-mpi.org/faq/?category=tcp#tcp-selection


But then I ran into yet another one issue. In 
http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

the way to define MCA parameters over environment variables is described.

I tried it:
$ export OMPI_MCA_oob_tcp_if_include=ib0
$ export OMPI_MCA_btl_tcp_if_include=ib0


I checked it:
$ ompi_info --param all all | grep oob_tcp_if_include
 MCA oob: parameter "oob_tcp_if_include" (current 
value: , data source: environment or cmdline)

$ ompi_info --param all all | grep btl_tcp_if_include
 MCA btl: parameter "btl_tcp_if_include" (current 
value: , data source: environment or cmdline)



But then I get again the hang-up issue!

==> seem, mpiexec does not understand these environment variables! and 
only get the command line options. This should not be so?


(I also tried to advise to provide the envvars by -x 
OMPI_MCA_oob_tcp_if_include -x OMPI_MCA_btl_tcp_if_include - nothing 
changed. Well, they are OMPI_ variables and should be provided in any case).



Best wishes and many thanks for all,

Paul Kapinos




Specifying both include and 
exclude should generate an error as those are mutually exclusive options 
- I think this was also missed in early 1.5 releases and was recently 
patched.


HTH
Ralph


On Nov 23, 2011, at 12:14 PM, TERRY DONTJE wrote:


On 11/23/2011 2:02 PM, Paul Kapinos wrote:

Hello Ralph, hello all,

Two news, as usual a good and a bad one.

The good: we believe to find out *why* it hangs

The bad: it seem for me, this is a bug or at least undocumented 
feature of Open MPI /1.5.x.


In detail:
As said, we see mystery hang-ups if starting on some nodes using some 
permutation of hostnames. Usually removing "some bad" nodes helps, 
sometimes a permutation of node names in the hostfile is enough(!). 
The behaviour is reproducible.


The machines have at least 2 networks:

*eth0* is used for installation, monitoring, ... - this ethernet is 
very slim


*ib0* - is the "IP over IB" interface and is used for everything: the 
file systems, ssh and so on. The hostnames are bound to the ib0 
network; our idea was not to use eth0 for MPI at all.


all machines are available from any over ib0 (are in one network).

But on eth0 there are at least two different networks; especially the 
computer linuxbsc025 is in different network than the others and is 
not reachable from other nodes over eth0! (but reachable over ib0. 
The name used in the hostfile is resolved to the IP of ib0 ).


So I believe that Open MPI /1.5.x tries to communicate over eth0 and 
cannot do it, and hangs. The /1.4.3 does not hang, so this issue is 
1.5.x-specific (seen in 1.5.3 and 1.5.4). A bug?


I also tried to disable the eth0 completely:

$ mpiexec -mca btl_tcp_if_exclude eth0,lo  -mca btl_tcp_if_include 
ib0 ...


I believe if you give "-mca btl_tcp_if_include ib0" you do not need to 
specify the exclude parameter.
...but this does not help. All right, the above command should 
disable the usage of eth0 for MPI communication itself, but it hangs 
just before the MPI is started, isn't it? (because one process lacks, 
the MPI_INIT cannot be passed)


By "just before the MPI is started" do you mean while orte is 
launching the processes.
I wonder if you need to specify "-mca oob_tcp_if_include ib0" also but 
I think that may depend on which oob you are using.
Now a question: is there a way to forbid the mpiexec to 

Re: [OMPI users] orte_debugger_select and orte_ess_set_name failed

2011-11-24 Thread Shiqing Fan

Hi MM,

Sorry for the delayed reply, I was busy in a meeting these days.

The log files seem not very helpful to solve the problem. May be your 
CMakeCache.txt file would help.


Currently we don't provided binaries built from trunk. Have you also 
tried the 1.5.x binaries?


Best Regards,
Shiqing

On 2011-11-23 10:08 PM, MM wrote:

Hi Shiqing,

Is the info provided useful to understand what's going on?
Alternatively, is there a way to get the provided binaries for win but off
trunk rather than off 1.5.4 as on the website, because I don't have this
problem when I link against those libs,

Thanks

MM

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of MM
Sent: 21 November 2011 21:08
To: f...@hlrs.de
Cc: 'Open MPI Users'
Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name failed

Hi,

I have placed the source in \Program Files\openmpi-1.5.4 the build dir in
\Program Files\openmpi.build and the install dir in \Program Files\openmpi

I could not find config.log in any of the 3 directories nor in the directory
from which I run mpirun.

The build log attached is a zip of all the .log under \Program
Files\openmpi.build

First, I installed the provided binaries on xp32bit, and successfully ran
the program in Release mode.
in debug mode, there was that error of some function missing in kernel, that
you fixed in svn.

Second, I then downloaded the source and built the static libraries w cmake
according to README.windows, and against these home built libs, the same
program run neithers in debug nor in release, because of the error below.

How can I generate the config.log?

About Debug/Release, thinking about it at this time, I don't really need the
debug libs of openmpi.
but to be able to link against vs2010 Release libs of openmpi, I need them
to be linked against the Release c runtime, so I might as well link against
the debug version of the openmpi libs.

Your help is very appreciated,
MM

-Original Message-
From: Shiqing Fan [mailto:f...@hlrs.de]
Sent: 21 November 2011 12:48
To: Open MPI Users
Cc: MM
Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name failed

Hi,

Could you please send your config and build log to me? Have you tried with a
simpler program? Does this error always happen?

Regards,
Shiqing


On 2011-11-19 4:24 PM, MM wrote:

Trying to run my program linked against debug 1.5.4 on vs2010 fails:


mpirun -np 1 .\nhui\Debug\nhui.exe : -np 1
.\nhcomp\Debug\nhcomp.exe

[PCNAME:04960] [[1282,0],0] ORTE_ERROR_LOG: Not found in file
C:\Program Files\openmpi-1.5.4\orte\mca\ess\hnp\ess_hnp_module.c at
line 536
--
 It looks like orte_init failed for some reason; your parallel
process is likely to abort.  There are many reasons that a parallel
process can fail during orte_init; some of which are due to
configuration or environment problems.  This failure appears to be an
internal failure; here's some additional information (which may only
be relevant to an Open MPI developer):

orte_debugger_select failed
-->   Returned value Not found (-13) instead of ORTE_SUCCESS
--
 [PCNAME:04960] [[1282,0],0] ORTE_ERROR_LOG: Not found in file
C:\Program Files\openmpi-1.5.4\orte\runtime\orte_init.c at line 128
--
 It looks like orte_init failed for some reason; your parallel
process is likely to abort.  There are many reasons that a parallel
process can fail during orte_init; some of which are due to
configuration or environment problems.  This failure appears to be an
internal failure; here's some additional information (which may only
be relevant to an Open MPI developer):

orte_ess_set_name failed
-->   Returned value Not found (-13) instead of ORTE_SUCCESS
--
 [LLDNRATDHY9H4J:04960] [[1282,0],0] ORTE_ERROR_LOG: Not found in
file C:\Program Files\openmpi-1.5.4\orte\tools\orterun\orterun.c at
line 616

any help is appreciated,
MM

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
---
Shiqing Fan
High Performance Computing Center Stuttgart (HLRS)
Tel: ++49(0)711-685-87234  Nobelstrasse 19
Fax: ++49(0)711-685-65832  70569 Stuttgart
http://www.hlrs.de/organization/people/shiqing-fan/
email: f...@hlrs.de



*
**
**
** WARNING:  This email contains an attachment of a very suspicious type.
**
** You are urged NOT to open this attachment unless you are absolutely
**
** sure it is legitimate.  Opening this attachment may cause irreparable
**
** damage to your computer and your files.  If you have 

Re: [OMPI users] Accessing OpenMPI processes over Internet using ssh

2011-11-24 Thread Reuti
Hi,

Am 24.11.2011 um 05:26 schrieb Jaison Paul:

> I am trying to access OpenMPI processes over Internet using ssh and not quite 
> successful, yet. I believe that I should be able to do it.
> 
> I have to run one process on my PC and the rest on a remote cluster over 
> internet. I have set the public keys (at .ssh/authorized_keys) to access 
> remote nodes without a password.
> 
> I use hostfile to run mpi. It will read something like:
> -
> localhost
> u...@remotehost.com

this is not a valid syntax for Open MPI.


> -
> But it fails.
> 
> The issue seems to be the user! That is, the user on my PC is different to 
> that of user at remotehosts. That's my assumption.
> 
> Is this the problem? Is there any work-around to solve this issue? Do I need 
> to have same username at all nodes to solve this issue?

You can define nicknames for an ssh connection in a file ~/.ssh/config like:

Host foobar
User baz
Hostname the.remote.server.demo
Port 1234

While this will work with any nickname for an ssh connection, in your case the 
nickname must match the one specified in the hostfile, as Open MPI won't use 
this lookup file:

Host remotehost.com
User user

ssh should then use the entries therein to initiate the connection. For details 
you can have a look at `man ssh_config`.

-- Reuti