Re: [OMPI users] trying to use personal copy of 1.7.4

2014-03-12 Thread Ross Boylan
On Wed, 2014-03-12 at 14:34 +, Dave Goodell (dgoodell) wrote:
> Perhaps there's an RPATH issue here?  I don't fully understand the structure 
> of Rmpi, but is there both an app and a library (or two separate libraries) 
> that are linking against MPI?
> 
> I.e., what we want is:
> 
> app -> ~ross/OMPI
> \  /
>  --> library --
> 
> But what we're getting is:
> 
> app ---> /usr/OMPI   
> \
>  --> library ---> ~ross/OMPI
> 
> 
> If one of them was first linked against the /usr/OMPI and managed to get an 
> RPATH then it could override your LD_LIBRARY_PATH.
> 
I think the relevant app here is R.  It was built without any awareness
of MPI, I'm pretty sure.  R loads the library Rmpi.so, which in turn
references MPI.  The R binary has no runpath or rpath according to
chrpath.

ldd Rmpi.so shows my local MPI libraries and not the system ones, though
it references plenty of other system libraries.

The system MPI libraries are in a standard place, /usr/lib
(/usr/lib/openmpi/lib/ more precisely) and so I don't think an rpath is
necessary to look for it.

Ross
> -Dave
> 
> On Mar 12, 2014, at 5:39 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> > Generally, all you need to ensure that your personal copy of OMPI is used 
> > is to set the PATH and LD_LIBRARY_PATH to point to your new Open MPI 
> > installation.  I do this all the time on my development cluster (where I 
> > have something like 6 billion different installations of OMPI available... 
> > mmm... should probably clean that up...)
> > 
> > export LD_LIBRARY_PATH=path_to_my_ompi/lib:$LD_LIBRARY_PATH
> > export PATH=path-to-my-ompi/bin:$PATH
> > 
> > It should be noted that:
> > 
> > 1. you need to *prefix* your PATH and LD_LIBRARY_PATH with these values
> > 2. you need to set these values in a way that will be picked up on all 
> > servers that you use in your job.  The safest way to do this is in your 
> > shell startup files (e.g., $HOME/.bashrc or whatever is relevant for your 
> > shell).
> > 
> > See http://www.open-mpi.org/faq/?category=running#run-prereqs, 
> > http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path, and 
> > http://www.open-mpi.org/faq/?category=running#mpirun-prefix.
> > 
> > Note the --prefix option that is described in the 3rd FAQ item I cited -- 
> > that can be a bit easier, too.
> > 
> > 
> > 
> > On Mar 12, 2014, at 2:51 AM, Ross Boylan  wrote:
> > 
> >> I took the advice here and built a personal copy of the current openmpi,
> >> to see if the problems I was having with Rmpi were a result of the old
> >> version on the system.
> >> 
> >> When I do ldd on the relevant libraries (Rmpi.so is loaded dynamically
> >> by R) everything looks fine; path references that should be local are.
> >> But when I run the program and do lsof it shows that both the system and
> >> personal versions of key libraries are opened.
> >> 
> >> First, does anyone know which library will actually be used, or how to
> >> tell which library is actually used, in this situation.  I'm running on
> >> linux (Debian squeeze)?
> >> 
> >> Second, it there some way to prevent the wrong/old/sytem libraries from
> >> being loaded?
> >> 
> >> FWIW I'm still seeing the old misbehavior when I run this way, but, as I
> >> said, I'm really not sure which libraries are being used.  Since Rmpi
> >> was built against the new/local ones, I think the fact that it doesn't
> >> crash means I really am using the new ones.
> >> 
> >> Here are highlights of lsof on the process running R:
> >> COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
> >> R   17634 ross  cwdDIR  254,212288 150773764 
> >> /home/ross/KHC/sunbelt
> >> R   17634 ross  rtdDIR8,1 4096 2 /
> >> R   17634 ross  txtREG8,1 5648   3058294 
> >> /usr/lib/R/bin/exec/R
> >> R   17634 ross  DELREG8,12416718 
> >> /tmp/openmpi-sessions-ross@n100_0/60429/1/shared_mem_pool.n100
> >> R   17634 ross  memREG8,1   335240   3105336 
> >> /usr/lib/openmpi/lib/libopen-pal.so.0.0.0
> >> R   17634 ross  memREG8,1   304576   3105337 
> >> /usr/lib/openmpi/lib/libopen-rte.so.0.0.0
> >> R   17634 ross  memREG8,1   679992   3105332 
> >> /usr/lib/openmpi/lib/libmpi.so.0.0.2
> >> R   17634 ross  memREG8,193936   2967826 
> >> /usr/lib/libz.so.1.2.3.4
> >> R   17634 ross  memREG8,110648   3187256 
> >> /lib/libutil-2.11.3.so
> >> R   17634 ross  memREG8,132320   2359631 
> >> /usr/lib/libpciaccess.so.0.10.8
> >> R   17634 ross  memREG8,133368   2359338 
> >> /usr/lib/libnuma.so.1
> >> R   17634 ross  memREG  254,2   979113 152045740 
> >> /home/ross/install/lib/libopen-pal.so.6.1.0
> >> R

Re: [OMPI users] trying to use personal copy of 1.7.4

2014-03-12 Thread Ralph Castain
I remember having a conversation with someone from R at Supercomputing last 
year, and this was one of the issues we discussed. The problem is that you have 
to ensure that R is built against the OMPI you are going to use, and it is 
usually better to have configured OMPI --disable-dlopen --enable-static to 
avoid library confusion when you later run R.

I'd give that a try and see if it solves your problems. The "recipe" given by 
Bennet looked right to me.


On Mar 12, 2014, at 12:32 PM, Ross Boylan  wrote:

> On Wed, 2014-03-12 at 11:50 +0100, Reuti wrote:
>> Am 12.03.2014 um 11:39 schrieb Jeff Squyres (jsquyres):
>> 
>>> Generally, all you need to ensure that your personal copy of OMPI is used 
>>> is to set the PATH and LD_LIBRARY_PATH to point to your new Open MPI 
>>> installation.  I do this all the time on my development cluster (where I 
>>> have something like 6 billion different installations of OMPI available... 
>>> mmm... should probably clean that up...)
>>> 
>>> export LD_LIBRARY_PATH=path_to_my_ompi/lib:$LD_LIBRARY_PATH
>>> export PATH=path-to-my-ompi/bin:$PATH
> 
> I believe I've already done that.  The script the launches everything is
> (all one line originally)
> R_PROFILE_USER=~/KHC/sunbelt/Rmpiprofile \
> LD_LIBRARY_PATH=/home/ross/install/lib:$LD_LIBRARY_PATH \
> PATH=/home/ross/install/bin:$PATH orterun -x R_PROFILE_USER -x
> LD_LIBRARY_PATH -x PATH -hostfile ~/KHC/sunbelt/hosts \
> -np 7 R --no-save  -q
> 
> There is a complication with R; it sticks stuff in front of
> LD_LIBRARY_PATH.  However, the startup script Rmpiprofile  fixes that,
> though I'm not entirely sure that is effective.  However, the old
> libraries that are being loaded are not from any directories R added to
> LD_LIBRARY_PATH; instead they are from /usr/lib, which is a standard
> place for the dynamic loader to look.
> 
> 
>>> It should be noted that:
>>> 
>>> 1. you need to *prefix* your PATH and LD_LIBRARY_PATH with these values
>>> 2. you need to set these values in a way that will be picked up on all 
>>> servers that you use in your job.  The safest way to do this is in your 
>>> shell startup files (e.g., $HOME/.bashrc or whatever is relevant for your 
>>> shell).
>> 
>> I see "libtorque" in the output below - were these jobs running inside a 
>> queuing system? The set paths might be different therein, and need to be set 
>> in the job script in this case.
>> 
> No batch system (see script above for launch mechanism). We threw a lot
> of stuff MPI configure was looking for onto the system.  AFAIK torque
> isn't even installed.
> 
> One possible issue is that the Rmpi module for R is not compiled by
> mpicc; R has its own notions of proper options for the compiler and its
> own infrastructure for building things.  I did  pass the location of my
> local libraries into the build process.
> 
> This seems more like an issue with the dynamic loader, or with whatever
> system R is using when it loads Rmpi.so.
> 
> Ross
>> -- Reuti
>> 
>> 
>>> See http://www.open-mpi.org/faq/?category=running#run-prereqs, 
>>> http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path, and 
>>> http://www.open-mpi.org/faq/?category=running#mpirun-prefix.
>>> 
>>> Note the --prefix option that is described in the 3rd FAQ item I cited -- 
>>> that can be a bit easier, too.
>>> 
>>> 
>>> 
>>> On Mar 12, 2014, at 2:51 AM, Ross Boylan  wrote:
>>> 
 I took the advice here and built a personal copy of the current openmpi,
 to see if the problems I was having with Rmpi were a result of the old
 version on the system.
 
 When I do ldd on the relevant libraries (Rmpi.so is loaded dynamically
 by R) everything looks fine; path references that should be local are.
 But when I run the program and do lsof it shows that both the system and
 personal versions of key libraries are opened.
 
 First, does anyone know which library will actually be used, or how to
 tell which library is actually used, in this situation.  I'm running on
 linux (Debian squeeze)?
 
 Second, it there some way to prevent the wrong/old/sytem libraries from
 being loaded?
 
 FWIW I'm still seeing the old misbehavior when I run this way, but, as I
 said, I'm really not sure which libraries are being used.  Since Rmpi
 was built against the new/local ones, I think the fact that it doesn't
 crash means I really am using the new ones.
 
 Here are highlights of lsof on the process running R:
 COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
 R   17634 ross  cwdDIR  254,212288 150773764 
 /home/ross/KHC/sunbelt
 R   17634 ross  rtdDIR8,1 4096 2 /
 R   17634 ross  txtREG8,1 5648   3058294 
 /usr/lib/R/bin/exec/R
 R   17634 ross  DELREG8,12416718 
 

Re: [OMPI users] trying to use personal copy of 1.7.4

2014-03-12 Thread Ross Boylan
On Wed, 2014-03-12 at 11:50 +0100, Reuti wrote:
> Am 12.03.2014 um 11:39 schrieb Jeff Squyres (jsquyres):
> 
> > Generally, all you need to ensure that your personal copy of OMPI is used 
> > is to set the PATH and LD_LIBRARY_PATH to point to your new Open MPI 
> > installation.  I do this all the time on my development cluster (where I 
> > have something like 6 billion different installations of OMPI available... 
> > mmm... should probably clean that up...)
> > 
> > export LD_LIBRARY_PATH=path_to_my_ompi/lib:$LD_LIBRARY_PATH
> > export PATH=path-to-my-ompi/bin:$PATH

I believe I've already done that.  The script the launches everything is
(all one line originally)
R_PROFILE_USER=~/KHC/sunbelt/Rmpiprofile \
LD_LIBRARY_PATH=/home/ross/install/lib:$LD_LIBRARY_PATH \
PATH=/home/ross/install/bin:$PATH orterun -x R_PROFILE_USER -x
LD_LIBRARY_PATH -x PATH -hostfile ~/KHC/sunbelt/hosts \
-np 7 R --no-save  -q

There is a complication with R; it sticks stuff in front of
LD_LIBRARY_PATH.  However, the startup script Rmpiprofile  fixes that,
though I'm not entirely sure that is effective.  However, the old
libraries that are being loaded are not from any directories R added to
LD_LIBRARY_PATH; instead they are from /usr/lib, which is a standard
place for the dynamic loader to look.


> > It should be noted that:
> > 
> > 1. you need to *prefix* your PATH and LD_LIBRARY_PATH with these values
> > 2. you need to set these values in a way that will be picked up on all 
> > servers that you use in your job.  The safest way to do this is in your 
> > shell startup files (e.g., $HOME/.bashrc or whatever is relevant for your 
> > shell).
> 
> I see "libtorque" in the output below - were these jobs running inside a 
> queuing system? The set paths might be different therein, and need to be set 
> in the job script in this case.
> 
No batch system (see script above for launch mechanism). We threw a lot
of stuff MPI configure was looking for onto the system.  AFAIK torque
isn't even installed.

One possible issue is that the Rmpi module for R is not compiled by
mpicc; R has its own notions of proper options for the compiler and its
own infrastructure for building things.  I did  pass the location of my
local libraries into the build process.

This seems more like an issue with the dynamic loader, or with whatever
system R is using when it loads Rmpi.so.

Ross
> -- Reuti
> 
> 
> > See http://www.open-mpi.org/faq/?category=running#run-prereqs, 
> > http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path, and 
> > http://www.open-mpi.org/faq/?category=running#mpirun-prefix.
> > 
> > Note the --prefix option that is described in the 3rd FAQ item I cited -- 
> > that can be a bit easier, too.
> > 
> > 
> > 
> > On Mar 12, 2014, at 2:51 AM, Ross Boylan  wrote:
> > 
> >> I took the advice here and built a personal copy of the current openmpi,
> >> to see if the problems I was having with Rmpi were a result of the old
> >> version on the system.
> >> 
> >> When I do ldd on the relevant libraries (Rmpi.so is loaded dynamically
> >> by R) everything looks fine; path references that should be local are.
> >> But when I run the program and do lsof it shows that both the system and
> >> personal versions of key libraries are opened.
> >> 
> >> First, does anyone know which library will actually be used, or how to
> >> tell which library is actually used, in this situation.  I'm running on
> >> linux (Debian squeeze)?
> >> 
> >> Second, it there some way to prevent the wrong/old/sytem libraries from
> >> being loaded?
> >> 
> >> FWIW I'm still seeing the old misbehavior when I run this way, but, as I
> >> said, I'm really not sure which libraries are being used.  Since Rmpi
> >> was built against the new/local ones, I think the fact that it doesn't
> >> crash means I really am using the new ones.
> >> 
> >> Here are highlights of lsof on the process running R:
> >> COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
> >> R   17634 ross  cwdDIR  254,212288 150773764 
> >> /home/ross/KHC/sunbelt
> >> R   17634 ross  rtdDIR8,1 4096 2 /
> >> R   17634 ross  txtREG8,1 5648   3058294 
> >> /usr/lib/R/bin/exec/R
> >> R   17634 ross  DELREG8,12416718 
> >> /tmp/openmpi-sessions-ross@n100_0/60429/1/shared_mem_pool.n100
> >> R   17634 ross  memREG8,1   335240   3105336 
> >> /usr/lib/openmpi/lib/libopen-pal.so.0.0.0
> >> R   17634 ross  memREG8,1   304576   3105337 
> >> /usr/lib/openmpi/lib/libopen-rte.so.0.0.0
> >> R   17634 ross  memREG8,1   679992   3105332 
> >> /usr/lib/openmpi/lib/libmpi.so.0.0.2
> >> R   17634 ross  memREG8,193936   2967826 
> >> /usr/lib/libz.so.1.2.3.4
> >> R   17634 ross  memREG8,110648   3187256 
> >> 

Re: [OMPI users] trying to use personal copy of 1.7.4

2014-03-12 Thread Bennet Fauber
My experience with Rmpi and OpenMPI is that it doesn't seem to do well
with the dlopen or dynamic loading.  I recently installed R 3.0.3, and
Rmpi, which failed when built against our standard OpenMPI but
succeeded using the following 'secret recipe'.  Perhaps there is
something here that will be helpful for you.

###  Install openmpi 1.6.5

export PREFIX=/scratch/support_flux/
bennet/local
COMPILERS='CC=gcc CXX=g++ FC=gfortran F77=gfortran'
CONFIGURE_FLAGS='--disable-dlopen --enable-static'
cd openmpi-1.6.5
./configure--prefix=${PREFIX} \
   --mandir=${PREFIX}/man \
   --with-tm=/usr/local/torque \
   --with-openib --with-psm \
   --with-io-romio-flags='--with-file-system=testfs+ufs+nfs+lustre' \
   $CONFIGURE_FLAGS \
   $COMPILERS
make
make check
make install

### Install R 3.0.3

wget http://cran.case.edu/src/base/R-3/R-3.0.3.tar.gz
tar xzvf R-3.0.3.tar.gz
cd R-3.0.3

export MPI_HOME=/scratch/support_
flux/bennet/local
export LD_LIBRARY_PATH=$MPI_HOME/lib:${LD_LIBRARY_PATH}
export LD_LIBRARY_PATH=$MPI_HOME/openmpi:${LD_LIBRARY_PATH}
export PATH=${PATH}:${MPI_HOME}/bin
export LDFLAGS='-Wl,-O1'
export R_PAPERSIZE=letter
export R_INST=${PREFIX}
export FFLAGS='-O3 -mtune=native'
export CFLAGS='-O3 -mtune=native'
./configure --prefix=${R_INST} --mandir=${R_INST}/man
--enable-R-shlib --without-x
make
make check
make install
wget http://www.stats.uwo.ca/faculty/yu/Rmpi/download/linux/Rmpi_0.6-3.tar.gz
R CMD INSTALL Rmpi_0.6-3.tar.gz \
   --configure-args="--with-Rmpi-include=$MPI_HOME/include
--with-Rmpi-libpath=$MPI_HOME/lib --with-Rmpi-type=OPENMPI"

Make sure environment variables and paths are set

MPI_HOME=/home/software/rhel6/openmpi-1.6.5/gcc-4.4.7-static
PATH=/home/software/rhel6/openmpi-1.6.5/gcc-4.4.7-static/bin
LD_LIBRARY_PATH=$LD_LIBRARY_PATH}:/home/software/rhel6/openmpi-1.6.5/gcc-4.4.7-static/lib
LD_LIBRARY_PATH=$LD_LIBRARY_PATH}:/home/software/rhel6/openmpi-1.6.5/gcc-4.4.7-static/lib/openmpi
PATH=/home/software/rhel6/R/3.0.3/bin:$LD_LIBRARY_PATH}
LD_LIBRARY_PATH=/home/software/rhel6/R/3.0.3/lib64/R/lib:$LD_LIBRARY_PATH}

##  Then install snow with
R
> install.packages('snow')
[ . . . .


I think the key thing is the --disable-dlopen, though it might require
both.  Jeff Squyres had a post about this quite a while ago that gives
more detail about what's happening:

http://www.open-mpi.org/community/lists/devel/2012/04/10840.php

-- bennet


Re: [OMPI users] trying to use personal copy of 1.7.4

2014-03-12 Thread Dave Goodell (dgoodell)
Perhaps there's an RPATH issue here?  I don't fully understand the structure of 
Rmpi, but is there both an app and a library (or two separate libraries) that 
are linking against MPI?

I.e., what we want is:

app -> ~ross/OMPI
\  /
 --> library --

But what we're getting is:

app ---> /usr/OMPI   
\
 --> library ---> ~ross/OMPI


If one of them was first linked against the /usr/OMPI and managed to get an 
RPATH then it could override your LD_LIBRARY_PATH.

-Dave

On Mar 12, 2014, at 5:39 AM, Jeff Squyres (jsquyres)  wrote:

> Generally, all you need to ensure that your personal copy of OMPI is used is 
> to set the PATH and LD_LIBRARY_PATH to point to your new Open MPI 
> installation.  I do this all the time on my development cluster (where I have 
> something like 6 billion different installations of OMPI available... mmm... 
> should probably clean that up...)
> 
> export LD_LIBRARY_PATH=path_to_my_ompi/lib:$LD_LIBRARY_PATH
> export PATH=path-to-my-ompi/bin:$PATH
> 
> It should be noted that:
> 
> 1. you need to *prefix* your PATH and LD_LIBRARY_PATH with these values
> 2. you need to set these values in a way that will be picked up on all 
> servers that you use in your job.  The safest way to do this is in your shell 
> startup files (e.g., $HOME/.bashrc or whatever is relevant for your shell).
> 
> See http://www.open-mpi.org/faq/?category=running#run-prereqs, 
> http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path, and 
> http://www.open-mpi.org/faq/?category=running#mpirun-prefix.
> 
> Note the --prefix option that is described in the 3rd FAQ item I cited -- 
> that can be a bit easier, too.
> 
> 
> 
> On Mar 12, 2014, at 2:51 AM, Ross Boylan  wrote:
> 
>> I took the advice here and built a personal copy of the current openmpi,
>> to see if the problems I was having with Rmpi were a result of the old
>> version on the system.
>> 
>> When I do ldd on the relevant libraries (Rmpi.so is loaded dynamically
>> by R) everything looks fine; path references that should be local are.
>> But when I run the program and do lsof it shows that both the system and
>> personal versions of key libraries are opened.
>> 
>> First, does anyone know which library will actually be used, or how to
>> tell which library is actually used, in this situation.  I'm running on
>> linux (Debian squeeze)?
>> 
>> Second, it there some way to prevent the wrong/old/sytem libraries from
>> being loaded?
>> 
>> FWIW I'm still seeing the old misbehavior when I run this way, but, as I
>> said, I'm really not sure which libraries are being used.  Since Rmpi
>> was built against the new/local ones, I think the fact that it doesn't
>> crash means I really am using the new ones.
>> 
>> Here are highlights of lsof on the process running R:
>> COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
>> R   17634 ross  cwdDIR  254,212288 150773764 
>> /home/ross/KHC/sunbelt
>> R   17634 ross  rtdDIR8,1 4096 2 /
>> R   17634 ross  txtREG8,1 5648   3058294 
>> /usr/lib/R/bin/exec/R
>> R   17634 ross  DELREG8,12416718 
>> /tmp/openmpi-sessions-ross@n100_0/60429/1/shared_mem_pool.n100
>> R   17634 ross  memREG8,1   335240   3105336 
>> /usr/lib/openmpi/lib/libopen-pal.so.0.0.0
>> R   17634 ross  memREG8,1   304576   3105337 
>> /usr/lib/openmpi/lib/libopen-rte.so.0.0.0
>> R   17634 ross  memREG8,1   679992   3105332 
>> /usr/lib/openmpi/lib/libmpi.so.0.0.2
>> R   17634 ross  memREG8,193936   2967826 
>> /usr/lib/libz.so.1.2.3.4
>> R   17634 ross  memREG8,110648   3187256 
>> /lib/libutil-2.11.3.so
>> R   17634 ross  memREG8,132320   2359631 
>> /usr/lib/libpciaccess.so.0.10.8
>> R   17634 ross  memREG8,133368   2359338 
>> /usr/lib/libnuma.so.1
>> R   17634 ross  memREG  254,2   979113 152045740 
>> /home/ross/install/lib/libopen-pal.so.6.1.0
>> R   17634 ross  memREG8,1   183456   2359592 
>> /usr/lib/libtorque.so.2.0.0
>> R   17634 ross  memREG  254,2  1058125 152045781 
>> /home/ross/install/lib/libopen-rte.so.7.0.0
>> R   17634 ross  memREG8,149936   2359341 
>> /usr/lib/libibverbs.so.1.0.0
>> R   17634 ross  memREG  254,2  2802579 152045867 
>> /home/ross/install/lib/libmpi.so.1.3.0
>> R   17634 ross  memREG  254,2   106626 152046481 
>> /home/ross/Rlib-3.0.1/Rmpi/libs/Rmpi.so
>> 
>> So libmpi, libopen-pal, and libopen-rte all are opened in two versions and 
>> two locations.
>> 
>> Thanks.
>> Ross Boylan
>> 
>> ___
>> users mailing list
>> 

Re: [OMPI users] Cannot run a job with more than 3 nodes

2014-03-12 Thread Jeff Squyres (jsquyres)
Can you verify that for all 4 nodes?  I.e., something like this:

foreach node (Node1 Node2 Node3 Node4)
  foreach other (Node1 Node2 Node3 Node 4)
 echo from $node to $other
 ssh $node ssh $other hostname



On Mar 12, 2014, at 7:34 AM, Victor  wrote:

> Yes they are. Can resolve and log into each node, from each node, using their 
> "friendly" name, not IP.
> 
> 
> On 12 March 2014 18:15, Jeff Squyres (jsquyres)  wrote:
> Are all names resolvable from all servers?
> 
> I.e., if you "ssh Node4" from Node1, Node2, and Node3, does it work?
> 
> 
> On Mar 12, 2014, at 4:07 AM, Victor  wrote:
> 
> > Hostname no I use lower case, but for some reason while I was writing 
> > the email I thought that upper case is clearer...
> >
> > The same version of Ubuntu (12.04 x64) is on all nodes and openmpi and the 
> > executable are shared via nfs.
> >
> >
> > On 12 March 2014 16:01, Reuti  wrote:
> > Hi,
> >
> > Am 12.03.2014 um 07:37 schrieb Victor:
> >
> > > I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd 
> > > problem.
> > >
> > > I have 4 nodes, all of which are defined in the hostfile and in 
> > > /etc/hosts.
> > >
> > > I can log into each node using ssh and certificate method from the shell 
> > > that is running the mpi job, by sing their name as defined in /etc/hosts.
> > >
> > > I can run an mpi job if I include only 3 nodes in the hostfile, for 
> > > example:
> > >
> > > Node1 slots=8 max-slots=8
> > > Node2 slots=8 max-slots=8
> > > Node3 slots=8 max-slots=8
> >
> > You are using an uppercase name here by intention - this is the one the 
> > host returns by `hostname`? Although it is allowed and should be mangled to 
> > lowercase resp. ignored for hostname resolution, I found that not all 
> > programs are doing it. Best is to use only lowercase characters is my 
> > experience.
> >
> > The same version of your Ubuntu Linux is installed on all machines?
> >
> > -- Reuti
> >
> >
> > > But if I add a fourth node into the hostfile eg:
> > >
> > > Node1 slots=8 max-slots=8
> > > Node2 slots=8 max-slots=8
> > > Node3 slots=8 max-slots=8
> > > Node4 slots=8 max-slots=8
> > >
> > > I get this error after attempting mpirun -np 32 --hostfile hostfile a.out:
> > >
> > > ssh: Could not resolve hostname Node4: Name or service not known.
> > >
> > > But, I can log into Node4 using ssh from the same shell by using ssh 
> > > Node4.
> > >
> > > Also if I mix up the hostfile like this for example and place Node1 to 
> > > the last spot:
> > >
> > > Node4 slots=8 max-slots=8
> > > Node2 slots=8 max-slots=8
> > > Node3 slots=8 max-slots=8
> > > Node1 slots=8 max-slots=8
> > >
> > > The error becomes
> > >
> > > ssh: Could not resolve hostname Node1: Name or service not known.
> > >
> > > If I then go back to the three node hostfile like this:
> > >
> > > Node1 slots=8 max-slots=8
> > > Node4 slots=8 max-slots=8
> > > Node2 slots=8 max-slots=8
> > >
> > > There is no error with three nodes even though both Node1 and Node4 
> > > "cannot be found" if they are present in a 4 node hostfile in the last 
> > > spot. The last slot seems to be bugged.
> > >
> > > What is going on? How do I fix this?
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Cannot run a job with more than 3 nodes

2014-03-12 Thread Victor
Yes they are. Can resolve and log into each node, from each node, using
their "friendly" name, not IP.


On 12 March 2014 18:15, Jeff Squyres (jsquyres)  wrote:

> Are all names resolvable from all servers?
>
> I.e., if you "ssh Node4" from Node1, Node2, and Node3, does it work?
>
>
> On Mar 12, 2014, at 4:07 AM, Victor  wrote:
>
> > Hostname no I use lower case, but for some reason while I was
> writing the email I thought that upper case is clearer...
> >
> > The same version of Ubuntu (12.04 x64) is on all nodes and openmpi and
> the executable are shared via nfs.
> >
> >
> > On 12 March 2014 16:01, Reuti  wrote:
> > Hi,
> >
> > Am 12.03.2014 um 07:37 schrieb Victor:
> >
> > > I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd
> problem.
> > >
> > > I have 4 nodes, all of which are defined in the hostfile and in
> /etc/hosts.
> > >
> > > I can log into each node using ssh and certificate method from the
> shell that is running the mpi job, by sing their name as defined in
> /etc/hosts.
> > >
> > > I can run an mpi job if I include only 3 nodes in the hostfile, for
> example:
> > >
> > > Node1 slots=8 max-slots=8
> > > Node2 slots=8 max-slots=8
> > > Node3 slots=8 max-slots=8
> >
> > You are using an uppercase name here by intention - this is the one the
> host returns by `hostname`? Although it is allowed and should be mangled to
> lowercase resp. ignored for hostname resolution, I found that not all
> programs are doing it. Best is to use only lowercase characters is my
> experience.
> >
> > The same version of your Ubuntu Linux is installed on all machines?
> >
> > -- Reuti
> >
> >
> > > But if I add a fourth node into the hostfile eg:
> > >
> > > Node1 slots=8 max-slots=8
> > > Node2 slots=8 max-slots=8
> > > Node3 slots=8 max-slots=8
> > > Node4 slots=8 max-slots=8
> > >
> > > I get this error after attempting mpirun -np 32 --hostfile hostfile
> a.out:
> > >
> > > ssh: Could not resolve hostname Node4: Name or service not known.
> > >
> > > But, I can log into Node4 using ssh from the same shell by using ssh
> Node4.
> > >
> > > Also if I mix up the hostfile like this for example and place Node1 to
> the last spot:
> > >
> > > Node4 slots=8 max-slots=8
> > > Node2 slots=8 max-slots=8
> > > Node3 slots=8 max-slots=8
> > > Node1 slots=8 max-slots=8
> > >
> > > The error becomes
> > >
> > > ssh: Could not resolve hostname Node1: Name or service not known.
> > >
> > > If I then go back to the three node hostfile like this:
> > >
> > > Node1 slots=8 max-slots=8
> > > Node4 slots=8 max-slots=8
> > > Node2 slots=8 max-slots=8
> > >
> > > There is no error with three nodes even though both Node1 and Node4
> "cannot be found" if they are present in a 4 node hostfile in the last
> spot. The last slot seems to be bugged.
> > >
> > > What is going on? How do I fix this?
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] trying to use personal copy of 1.7.4

2014-03-12 Thread Reuti
Am 12.03.2014 um 11:39 schrieb Jeff Squyres (jsquyres):

> Generally, all you need to ensure that your personal copy of OMPI is used is 
> to set the PATH and LD_LIBRARY_PATH to point to your new Open MPI 
> installation.  I do this all the time on my development cluster (where I have 
> something like 6 billion different installations of OMPI available... mmm... 
> should probably clean that up...)
> 
> export LD_LIBRARY_PATH=path_to_my_ompi/lib:$LD_LIBRARY_PATH
> export PATH=path-to-my-ompi/bin:$PATH
> 
> It should be noted that:
> 
> 1. you need to *prefix* your PATH and LD_LIBRARY_PATH with these values
> 2. you need to set these values in a way that will be picked up on all 
> servers that you use in your job.  The safest way to do this is in your shell 
> startup files (e.g., $HOME/.bashrc or whatever is relevant for your shell).

I see "libtorque" in the output below - were these jobs running inside a 
queuing system? The set paths might be different therein, and need to be set in 
the job script in this case.

-- Reuti


> See http://www.open-mpi.org/faq/?category=running#run-prereqs, 
> http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path, and 
> http://www.open-mpi.org/faq/?category=running#mpirun-prefix.
> 
> Note the --prefix option that is described in the 3rd FAQ item I cited -- 
> that can be a bit easier, too.
> 
> 
> 
> On Mar 12, 2014, at 2:51 AM, Ross Boylan  wrote:
> 
>> I took the advice here and built a personal copy of the current openmpi,
>> to see if the problems I was having with Rmpi were a result of the old
>> version on the system.
>> 
>> When I do ldd on the relevant libraries (Rmpi.so is loaded dynamically
>> by R) everything looks fine; path references that should be local are.
>> But when I run the program and do lsof it shows that both the system and
>> personal versions of key libraries are opened.
>> 
>> First, does anyone know which library will actually be used, or how to
>> tell which library is actually used, in this situation.  I'm running on
>> linux (Debian squeeze)?
>> 
>> Second, it there some way to prevent the wrong/old/sytem libraries from
>> being loaded?
>> 
>> FWIW I'm still seeing the old misbehavior when I run this way, but, as I
>> said, I'm really not sure which libraries are being used.  Since Rmpi
>> was built against the new/local ones, I think the fact that it doesn't
>> crash means I really am using the new ones.
>> 
>> Here are highlights of lsof on the process running R:
>> COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
>> R   17634 ross  cwdDIR  254,212288 150773764 
>> /home/ross/KHC/sunbelt
>> R   17634 ross  rtdDIR8,1 4096 2 /
>> R   17634 ross  txtREG8,1 5648   3058294 
>> /usr/lib/R/bin/exec/R
>> R   17634 ross  DELREG8,12416718 
>> /tmp/openmpi-sessions-ross@n100_0/60429/1/shared_mem_pool.n100
>> R   17634 ross  memREG8,1   335240   3105336 
>> /usr/lib/openmpi/lib/libopen-pal.so.0.0.0
>> R   17634 ross  memREG8,1   304576   3105337 
>> /usr/lib/openmpi/lib/libopen-rte.so.0.0.0
>> R   17634 ross  memREG8,1   679992   3105332 
>> /usr/lib/openmpi/lib/libmpi.so.0.0.2
>> R   17634 ross  memREG8,193936   2967826 
>> /usr/lib/libz.so.1.2.3.4
>> R   17634 ross  memREG8,110648   3187256 
>> /lib/libutil-2.11.3.so
>> R   17634 ross  memREG8,132320   2359631 
>> /usr/lib/libpciaccess.so.0.10.8
>> R   17634 ross  memREG8,133368   2359338 
>> /usr/lib/libnuma.so.1
>> R   17634 ross  memREG  254,2   979113 152045740 
>> /home/ross/install/lib/libopen-pal.so.6.1.0
>> R   17634 ross  memREG8,1   183456   2359592 
>> /usr/lib/libtorque.so.2.0.0
>> R   17634 ross  memREG  254,2  1058125 152045781 
>> /home/ross/install/lib/libopen-rte.so.7.0.0
>> R   17634 ross  memREG8,149936   2359341 
>> /usr/lib/libibverbs.so.1.0.0
>> R   17634 ross  memREG  254,2  2802579 152045867 
>> /home/ross/install/lib/libmpi.so.1.3.0
>> R   17634 ross  memREG  254,2   106626 152046481 
>> /home/ross/Rlib-3.0.1/Rmpi/libs/Rmpi.so
>> 
>> So libmpi, libopen-pal, and libopen-rte all are opened in two versions and 
>> two locations.
>> 
>> Thanks.
>> Ross Boylan
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> us...@open-mpi.org
> 

Re: [OMPI users] trying to use personal copy of 1.7.4

2014-03-12 Thread Jeff Squyres (jsquyres)
Generally, all you need to ensure that your personal copy of OMPI is used is to 
set the PATH and LD_LIBRARY_PATH to point to your new Open MPI installation.  I 
do this all the time on my development cluster (where I have something like 6 
billion different installations of OMPI available... mmm... should probably 
clean that up...)

export LD_LIBRARY_PATH=path_to_my_ompi/lib:$LD_LIBRARY_PATH
export PATH=path-to-my-ompi/bin:$PATH

It should be noted that:

1. you need to *prefix* your PATH and LD_LIBRARY_PATH with these values
2. you need to set these values in a way that will be picked up on all servers 
that you use in your job.  The safest way to do this is in your shell startup 
files (e.g., $HOME/.bashrc or whatever is relevant for your shell).

See http://www.open-mpi.org/faq/?category=running#run-prereqs, 
http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path, and 
http://www.open-mpi.org/faq/?category=running#mpirun-prefix.

Note the --prefix option that is described in the 3rd FAQ item I cited -- that 
can be a bit easier, too.



On Mar 12, 2014, at 2:51 AM, Ross Boylan  wrote:

> I took the advice here and built a personal copy of the current openmpi,
> to see if the problems I was having with Rmpi were a result of the old
> version on the system.
> 
> When I do ldd on the relevant libraries (Rmpi.so is loaded dynamically
> by R) everything looks fine; path references that should be local are.
> But when I run the program and do lsof it shows that both the system and
> personal versions of key libraries are opened.
> 
> First, does anyone know which library will actually be used, or how to
> tell which library is actually used, in this situation.  I'm running on
> linux (Debian squeeze)?
> 
> Second, it there some way to prevent the wrong/old/sytem libraries from
> being loaded?
> 
> FWIW I'm still seeing the old misbehavior when I run this way, but, as I
> said, I'm really not sure which libraries are being used.  Since Rmpi
> was built against the new/local ones, I think the fact that it doesn't
> crash means I really am using the new ones.
> 
> Here are highlights of lsof on the process running R:
> COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
> R   17634 ross  cwdDIR  254,212288 150773764 
> /home/ross/KHC/sunbelt
> R   17634 ross  rtdDIR8,1 4096 2 /
> R   17634 ross  txtREG8,1 5648   3058294 
> /usr/lib/R/bin/exec/R
> R   17634 ross  DELREG8,12416718 
> /tmp/openmpi-sessions-ross@n100_0/60429/1/shared_mem_pool.n100
> R   17634 ross  memREG8,1   335240   3105336 
> /usr/lib/openmpi/lib/libopen-pal.so.0.0.0
> R   17634 ross  memREG8,1   304576   3105337 
> /usr/lib/openmpi/lib/libopen-rte.so.0.0.0
> R   17634 ross  memREG8,1   679992   3105332 
> /usr/lib/openmpi/lib/libmpi.so.0.0.2
> R   17634 ross  memREG8,193936   2967826 
> /usr/lib/libz.so.1.2.3.4
> R   17634 ross  memREG8,110648   3187256 
> /lib/libutil-2.11.3.so
> R   17634 ross  memREG8,132320   2359631 
> /usr/lib/libpciaccess.so.0.10.8
> R   17634 ross  memREG8,133368   2359338 
> /usr/lib/libnuma.so.1
> R   17634 ross  memREG  254,2   979113 152045740 
> /home/ross/install/lib/libopen-pal.so.6.1.0
> R   17634 ross  memREG8,1   183456   2359592 
> /usr/lib/libtorque.so.2.0.0
> R   17634 ross  memREG  254,2  1058125 152045781 
> /home/ross/install/lib/libopen-rte.so.7.0.0
> R   17634 ross  memREG8,149936   2359341 
> /usr/lib/libibverbs.so.1.0.0
> R   17634 ross  memREG  254,2  2802579 152045867 
> /home/ross/install/lib/libmpi.so.1.3.0
> R   17634 ross  memREG  254,2   106626 152046481 
> /home/ross/Rlib-3.0.1/Rmpi/libs/Rmpi.so
> 
> So libmpi, libopen-pal, and libopen-rte all are opened in two versions and 
> two locations.
> 
> Thanks.
> Ross Boylan
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] ctrl+c to abort a job with openmpi-1.7.5rc2

2014-03-12 Thread tmishima


Thanks, Jeff. I really understood the situation.

Tetsuya

> This all seems to be a side-effect of r30942 -- see:
>
> https://svn.open-mpi.org/trac/ompi/ticket/4365
>
>
> On Mar 12, 2014, at 5:13 AM,  wrote:
>
> >
> >
> > Hi Ralph,
> >
> > I installed openmpi-1.7.5rc2 and applied r31019 to it.
> > As far as I confirmed, rmaps framework worked fine.
> >
> > However, by chance, I noticed that single ctrl+c typing could
> > not terminate a running job. Twice typing was necessary.
> > Is this your expected behavior?
> >
> > I didn't use ctrl+c to abort for a while, I don't know when
> > it started. At least I can terminate the job by single ctrl+c
> > if I use openmpi-1.7.4.
> >
> > And, for your information, when I hit ctrl+c with more than 5
> > seconds interval, I get the message below:
> > Abort is in progress...hit ctrl-c again within 5 seconds to forcibly
> > terminate
> >
> > Tetsuya
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] ctrl+c to abort a job with openmpi-1.7.5rc2

2014-03-12 Thread Jeff Squyres (jsquyres)
This all seems to be a side-effect of r30942 -- see:

https://svn.open-mpi.org/trac/ompi/ticket/4365


On Mar 12, 2014, at 5:13 AM,  wrote:

> 
> 
> Hi Ralph,
> 
> I installed openmpi-1.7.5rc2 and applied r31019 to it.
> As far as I confirmed, rmaps framework worked fine.
> 
> However, by chance, I noticed that single ctrl+c typing could
> not terminate a running job. Twice typing was necessary.
> Is this your expected behavior?
> 
> I didn't use ctrl+c to abort for a while, I don't know when
> it started. At least I can terminate the job by single ctrl+c
> if I use openmpi-1.7.4.
> 
> And, for your information, when I hit ctrl+c with more than 5
> seconds interval, I get the message below:
> Abort is in progress...hit ctrl-c again within 5 seconds to forcibly
> terminate
> 
> Tetsuya
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Cannot run a job with more than 3 nodes

2014-03-12 Thread Jeff Squyres (jsquyres)
Are all names resolvable from all servers?

I.e., if you "ssh Node4" from Node1, Node2, and Node3, does it work?


On Mar 12, 2014, at 4:07 AM, Victor  wrote:

> Hostname no I use lower case, but for some reason while I was writing the 
> email I thought that upper case is clearer...
> 
> The same version of Ubuntu (12.04 x64) is on all nodes and openmpi and the 
> executable are shared via nfs.
> 
> 
> On 12 March 2014 16:01, Reuti  wrote:
> Hi,
> 
> Am 12.03.2014 um 07:37 schrieb Victor:
> 
> > I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd problem.
> >
> > I have 4 nodes, all of which are defined in the hostfile and in /etc/hosts.
> >
> > I can log into each node using ssh and certificate method from the shell 
> > that is running the mpi job, by sing their name as defined in /etc/hosts.
> >
> > I can run an mpi job if I include only 3 nodes in the hostfile, for example:
> >
> > Node1 slots=8 max-slots=8
> > Node2 slots=8 max-slots=8
> > Node3 slots=8 max-slots=8
> 
> You are using an uppercase name here by intention - this is the one the host 
> returns by `hostname`? Although it is allowed and should be mangled to 
> lowercase resp. ignored for hostname resolution, I found that not all 
> programs are doing it. Best is to use only lowercase characters is my 
> experience.
> 
> The same version of your Ubuntu Linux is installed on all machines?
> 
> -- Reuti
> 
> 
> > But if I add a fourth node into the hostfile eg:
> >
> > Node1 slots=8 max-slots=8
> > Node2 slots=8 max-slots=8
> > Node3 slots=8 max-slots=8
> > Node4 slots=8 max-slots=8
> >
> > I get this error after attempting mpirun -np 32 --hostfile hostfile a.out:
> >
> > ssh: Could not resolve hostname Node4: Name or service not known.
> >
> > But, I can log into Node4 using ssh from the same shell by using ssh Node4.
> >
> > Also if I mix up the hostfile like this for example and place Node1 to the 
> > last spot:
> >
> > Node4 slots=8 max-slots=8
> > Node2 slots=8 max-slots=8
> > Node3 slots=8 max-slots=8
> > Node1 slots=8 max-slots=8
> >
> > The error becomes
> >
> > ssh: Could not resolve hostname Node1: Name or service not known.
> >
> > If I then go back to the three node hostfile like this:
> >
> > Node1 slots=8 max-slots=8
> > Node4 slots=8 max-slots=8
> > Node2 slots=8 max-slots=8
> >
> > There is no error with three nodes even though both Node1 and Node4 "cannot 
> > be found" if they are present in a 4 node hostfile in the last spot. The 
> > last slot seems to be bugged.
> >
> > What is going on? How do I fix this?
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI users] ctrl+c to abort a job with openmpi-1.7.5rc2

2014-03-12 Thread tmishima


Hi Ralph,

I installed openmpi-1.7.5rc2 and applied r31019 to it.
As far as I confirmed, rmaps framework worked fine.

However, by chance, I noticed that single ctrl+c typing could
not terminate a running job. Twice typing was necessary.
Is this your expected behavior?

I didn't use ctrl+c to abort for a while, I don't know when
it started. At least I can terminate the job by single ctrl+c
if I use openmpi-1.7.4.

And, for your information, when I hit ctrl+c with more than 5
seconds interval, I get the message below:
Abort is in progress...hit ctrl-c again within 5 seconds to forcibly
terminate

Tetsuya



Re: [OMPI users] Cannot run a job with more than 3 nodes

2014-03-12 Thread Victor
Hostname no I use lower case, but for some reason while I was writing
the email I thought that upper case is clearer...

The same version of Ubuntu (12.04 x64) is on all nodes and openmpi and the
executable are shared via nfs.


On 12 March 2014 16:01, Reuti  wrote:

> Hi,
>
> Am 12.03.2014 um 07:37 schrieb Victor:
>
> > I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd
> problem.
> >
> > I have 4 nodes, all of which are defined in the hostfile and in
> /etc/hosts.
> >
> > I can log into each node using ssh and certificate method from the shell
> that is running the mpi job, by sing their name as defined in /etc/hosts.
> >
> > I can run an mpi job if I include only 3 nodes in the hostfile, for
> example:
> >
> > Node1 slots=8 max-slots=8
> > Node2 slots=8 max-slots=8
> > Node3 slots=8 max-slots=8
>
> You are using an uppercase name here by intention - this is the one the
> host returns by `hostname`? Although it is allowed and should be mangled to
> lowercase resp. ignored for hostname resolution, I found that not all
> programs are doing it. Best is to use only lowercase characters is my
> experience.
>
> The same version of your Ubuntu Linux is installed on all machines?
>
> -- Reuti
>
>
> > But if I add a fourth node into the hostfile eg:
> >
> > Node1 slots=8 max-slots=8
> > Node2 slots=8 max-slots=8
> > Node3 slots=8 max-slots=8
> > Node4 slots=8 max-slots=8
> >
> > I get this error after attempting mpirun -np 32 --hostfile hostfile
> a.out:
> >
> > ssh: Could not resolve hostname Node4: Name or service not known.
> >
> > But, I can log into Node4 using ssh from the same shell by using ssh
> Node4.
> >
> > Also if I mix up the hostfile like this for example and place Node1 to
> the last spot:
> >
> > Node4 slots=8 max-slots=8
> > Node2 slots=8 max-slots=8
> > Node3 slots=8 max-slots=8
> > Node1 slots=8 max-slots=8
> >
> > The error becomes
> >
> > ssh: Could not resolve hostname Node1: Name or service not known.
> >
> > If I then go back to the three node hostfile like this:
> >
> > Node1 slots=8 max-slots=8
> > Node4 slots=8 max-slots=8
> > Node2 slots=8 max-slots=8
> >
> > There is no error with three nodes even though both Node1 and Node4
> "cannot be found" if they are present in a 4 node hostfile in the last
> spot. The last slot seems to be bugged.
> >
> > What is going on? How do I fix this?
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Cannot run a job with more than 3 nodes

2014-03-12 Thread Reuti
Hi,

Am 12.03.2014 um 07:37 schrieb Victor:

> I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd problem.
> 
> I have 4 nodes, all of which are defined in the hostfile and in /etc/hosts.
> 
> I can log into each node using ssh and certificate method from the shell that 
> is running the mpi job, by sing their name as defined in /etc/hosts.
> 
> I can run an mpi job if I include only 3 nodes in the hostfile, for example:
> 
> Node1 slots=8 max-slots=8
> Node2 slots=8 max-slots=8
> Node3 slots=8 max-slots=8

You are using an uppercase name here by intention - this is the one the host 
returns by `hostname`? Although it is allowed and should be mangled to 
lowercase resp. ignored for hostname resolution, I found that not all programs 
are doing it. Best is to use only lowercase characters is my experience.

The same version of your Ubuntu Linux is installed on all machines?

-- Reuti


> But if I add a fourth node into the hostfile eg:
> 
> Node1 slots=8 max-slots=8
> Node2 slots=8 max-slots=8
> Node3 slots=8 max-slots=8
> Node4 slots=8 max-slots=8
> 
> I get this error after attempting mpirun -np 32 --hostfile hostfile a.out:
> 
> ssh: Could not resolve hostname Node4: Name or service not known.
> 
> But, I can log into Node4 using ssh from the same shell by using ssh Node4.
> 
> Also if I mix up the hostfile like this for example and place Node1 to the 
> last spot:
> 
> Node4 slots=8 max-slots=8
> Node2 slots=8 max-slots=8
> Node3 slots=8 max-slots=8
> Node1 slots=8 max-slots=8
> 
> The error becomes 
> 
> ssh: Could not resolve hostname Node1: Name or service not known.
> 
> If I then go back to the three node hostfile like this:
> 
> Node1 slots=8 max-slots=8
> Node4 slots=8 max-slots=8
> Node2 slots=8 max-slots=8
> 
> There is no error with three nodes even though both Node1 and Node4 "cannot 
> be found" if they are present in a 4 node hostfile in the last spot. The last 
> slot seems to be bugged.
> 
> What is going on? How do I fix this?
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Cannot run a job with more than 3 nodes

2014-03-12 Thread Victor
I "fixed it" by finding the message regarding tree spawn in a thread from
November 2013. When I run the job with -mca plm_rsh_no_tree_spawn 1 the job
works over 4 nodes.

I cannot identify any errors in ssh key setup and since I am only using 4
nodes I am not concerned about somewhat slower launch speed. Is faster job
launch speed the only benefit of tree spawn?


[OMPI users] trying to use personal copy of 1.7.4

2014-03-12 Thread Ross Boylan
I took the advice here and built a personal copy of the current openmpi,
to see if the problems I was having with Rmpi were a result of the old
version on the system.

When I do ldd on the relevant libraries (Rmpi.so is loaded dynamically
by R) everything looks fine; path references that should be local are.
But when I run the program and do lsof it shows that both the system and
personal versions of key libraries are opened.

First, does anyone know which library will actually be used, or how to
tell which library is actually used, in this situation.  I'm running on
linux (Debian squeeze)?

Second, it there some way to prevent the wrong/old/sytem libraries from
being loaded?

FWIW I'm still seeing the old misbehavior when I run this way, but, as I
said, I'm really not sure which libraries are being used.  Since Rmpi
was built against the new/local ones, I think the fact that it doesn't
crash means I really am using the new ones.

Here are highlights of lsof on the process running R:
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
R   17634 ross  cwdDIR  254,212288 150773764 
/home/ross/KHC/sunbelt
R   17634 ross  rtdDIR8,1 4096 2 /
R   17634 ross  txtREG8,1 5648   3058294 
/usr/lib/R/bin/exec/R
R   17634 ross  DELREG8,12416718 
/tmp/openmpi-sessions-ross@n100_0/60429/1/shared_mem_pool.n100
R   17634 ross  memREG8,1   335240   3105336 
/usr/lib/openmpi/lib/libopen-pal.so.0.0.0
R   17634 ross  memREG8,1   304576   3105337 
/usr/lib/openmpi/lib/libopen-rte.so.0.0.0
R   17634 ross  memREG8,1   679992   3105332 
/usr/lib/openmpi/lib/libmpi.so.0.0.2
R   17634 ross  memREG8,193936   2967826 
/usr/lib/libz.so.1.2.3.4
R   17634 ross  memREG8,110648   3187256 
/lib/libutil-2.11.3.so
R   17634 ross  memREG8,132320   2359631 
/usr/lib/libpciaccess.so.0.10.8
R   17634 ross  memREG8,133368   2359338 
/usr/lib/libnuma.so.1
R   17634 ross  memREG  254,2   979113 152045740 
/home/ross/install/lib/libopen-pal.so.6.1.0
R   17634 ross  memREG8,1   183456   2359592 
/usr/lib/libtorque.so.2.0.0
R   17634 ross  memREG  254,2  1058125 152045781 
/home/ross/install/lib/libopen-rte.so.7.0.0
R   17634 ross  memREG8,149936   2359341 
/usr/lib/libibverbs.so.1.0.0
R   17634 ross  memREG  254,2  2802579 152045867 
/home/ross/install/lib/libmpi.so.1.3.0
R   17634 ross  memREG  254,2   106626 152046481 
/home/ross/Rlib-3.0.1/Rmpi/libs/Rmpi.so

So libmpi, libopen-pal, and libopen-rte all are opened in two versions and two 
locations.

Thanks.
Ross Boylan



[OMPI users] Cannot run a job with more than 3 nodes

2014-03-12 Thread Victor
I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd problem.

I have 4 nodes, all of which are defined in the hostfile and in /etc/hosts.

I can log into each node using ssh and certificate method from the shell
that is running the mpi job, by sing their name as defined in /etc/hosts.

I can run an mpi job if I include only 3 nodes in the hostfile, for example:

Node1 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8

But if I add a fourth node into the hostfile eg:

Node1 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
Node4 slots=8 max-slots=8

I get this error after attempting mpirun -np 32 --hostfile hostfile a.out:

ssh: Could not resolve hostname Node4: Name or service not known.

But, I can log into Node4 using ssh from the same shell by using ssh Node4.

Also if I mix up the hostfile like this for example and place Node1 to the
last spot:

Node4 slots=8 max-slots=8
Node2 slots=8 max-slots=8
Node3 slots=8 max-slots=8
Node1 slots=8 max-slots=8

The error becomes

ssh: Could not resolve hostname Node1: Name or service not known.

If I then go back to the three node hostfile like this:

Node1 slots=8 max-slots=8
Node4 slots=8 max-slots=8
Node2 slots=8 max-slots=8

There is no error with three nodes even though both Node1 and Node4 "cannot
be found" if they are present in a 4 node hostfile in the last spot. The
last slot seems to be bugged.

What is going on? How do I fix this?