Re: [OMPI users] trying to use personal copy of 1.7.4
On Wed, 2014-03-12 at 14:34 +, Dave Goodell (dgoodell) wrote: > Perhaps there's an RPATH issue here? I don't fully understand the structure > of Rmpi, but is there both an app and a library (or two separate libraries) > that are linking against MPI? > > I.e., what we want is: > > app -> ~ross/OMPI > \ / > --> library -- > > But what we're getting is: > > app ---> /usr/OMPI > \ > --> library ---> ~ross/OMPI > > > If one of them was first linked against the /usr/OMPI and managed to get an > RPATH then it could override your LD_LIBRARY_PATH. > I think the relevant app here is R. It was built without any awareness of MPI, I'm pretty sure. R loads the library Rmpi.so, which in turn references MPI. The R binary has no runpath or rpath according to chrpath. ldd Rmpi.so shows my local MPI libraries and not the system ones, though it references plenty of other system libraries. The system MPI libraries are in a standard place, /usr/lib (/usr/lib/openmpi/lib/ more precisely) and so I don't think an rpath is necessary to look for it. Ross > -Dave > > On Mar 12, 2014, at 5:39 AM, Jeff Squyres (jsquyres)> wrote: > > > Generally, all you need to ensure that your personal copy of OMPI is used > > is to set the PATH and LD_LIBRARY_PATH to point to your new Open MPI > > installation. I do this all the time on my development cluster (where I > > have something like 6 billion different installations of OMPI available... > > mmm... should probably clean that up...) > > > > export LD_LIBRARY_PATH=path_to_my_ompi/lib:$LD_LIBRARY_PATH > > export PATH=path-to-my-ompi/bin:$PATH > > > > It should be noted that: > > > > 1. you need to *prefix* your PATH and LD_LIBRARY_PATH with these values > > 2. you need to set these values in a way that will be picked up on all > > servers that you use in your job. The safest way to do this is in your > > shell startup files (e.g., $HOME/.bashrc or whatever is relevant for your > > shell). > > > > See http://www.open-mpi.org/faq/?category=running#run-prereqs, > > http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path, and > > http://www.open-mpi.org/faq/?category=running#mpirun-prefix. > > > > Note the --prefix option that is described in the 3rd FAQ item I cited -- > > that can be a bit easier, too. > > > > > > > > On Mar 12, 2014, at 2:51 AM, Ross Boylan wrote: > > > >> I took the advice here and built a personal copy of the current openmpi, > >> to see if the problems I was having with Rmpi were a result of the old > >> version on the system. > >> > >> When I do ldd on the relevant libraries (Rmpi.so is loaded dynamically > >> by R) everything looks fine; path references that should be local are. > >> But when I run the program and do lsof it shows that both the system and > >> personal versions of key libraries are opened. > >> > >> First, does anyone know which library will actually be used, or how to > >> tell which library is actually used, in this situation. I'm running on > >> linux (Debian squeeze)? > >> > >> Second, it there some way to prevent the wrong/old/sytem libraries from > >> being loaded? > >> > >> FWIW I'm still seeing the old misbehavior when I run this way, but, as I > >> said, I'm really not sure which libraries are being used. Since Rmpi > >> was built against the new/local ones, I think the fact that it doesn't > >> crash means I really am using the new ones. > >> > >> Here are highlights of lsof on the process running R: > >> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > >> R 17634 ross cwdDIR 254,212288 150773764 > >> /home/ross/KHC/sunbelt > >> R 17634 ross rtdDIR8,1 4096 2 / > >> R 17634 ross txtREG8,1 5648 3058294 > >> /usr/lib/R/bin/exec/R > >> R 17634 ross DELREG8,12416718 > >> /tmp/openmpi-sessions-ross@n100_0/60429/1/shared_mem_pool.n100 > >> R 17634 ross memREG8,1 335240 3105336 > >> /usr/lib/openmpi/lib/libopen-pal.so.0.0.0 > >> R 17634 ross memREG8,1 304576 3105337 > >> /usr/lib/openmpi/lib/libopen-rte.so.0.0.0 > >> R 17634 ross memREG8,1 679992 3105332 > >> /usr/lib/openmpi/lib/libmpi.so.0.0.2 > >> R 17634 ross memREG8,193936 2967826 > >> /usr/lib/libz.so.1.2.3.4 > >> R 17634 ross memREG8,110648 3187256 > >> /lib/libutil-2.11.3.so > >> R 17634 ross memREG8,132320 2359631 > >> /usr/lib/libpciaccess.so.0.10.8 > >> R 17634 ross memREG8,133368 2359338 > >> /usr/lib/libnuma.so.1 > >> R 17634 ross memREG 254,2 979113 152045740 > >> /home/ross/install/lib/libopen-pal.so.6.1.0 > >> R
Re: [OMPI users] trying to use personal copy of 1.7.4
I remember having a conversation with someone from R at Supercomputing last year, and this was one of the issues we discussed. The problem is that you have to ensure that R is built against the OMPI you are going to use, and it is usually better to have configured OMPI --disable-dlopen --enable-static to avoid library confusion when you later run R. I'd give that a try and see if it solves your problems. The "recipe" given by Bennet looked right to me. On Mar 12, 2014, at 12:32 PM, Ross Boylanwrote: > On Wed, 2014-03-12 at 11:50 +0100, Reuti wrote: >> Am 12.03.2014 um 11:39 schrieb Jeff Squyres (jsquyres): >> >>> Generally, all you need to ensure that your personal copy of OMPI is used >>> is to set the PATH and LD_LIBRARY_PATH to point to your new Open MPI >>> installation. I do this all the time on my development cluster (where I >>> have something like 6 billion different installations of OMPI available... >>> mmm... should probably clean that up...) >>> >>> export LD_LIBRARY_PATH=path_to_my_ompi/lib:$LD_LIBRARY_PATH >>> export PATH=path-to-my-ompi/bin:$PATH > > I believe I've already done that. The script the launches everything is > (all one line originally) > R_PROFILE_USER=~/KHC/sunbelt/Rmpiprofile \ > LD_LIBRARY_PATH=/home/ross/install/lib:$LD_LIBRARY_PATH \ > PATH=/home/ross/install/bin:$PATH orterun -x R_PROFILE_USER -x > LD_LIBRARY_PATH -x PATH -hostfile ~/KHC/sunbelt/hosts \ > -np 7 R --no-save -q > > There is a complication with R; it sticks stuff in front of > LD_LIBRARY_PATH. However, the startup script Rmpiprofile fixes that, > though I'm not entirely sure that is effective. However, the old > libraries that are being loaded are not from any directories R added to > LD_LIBRARY_PATH; instead they are from /usr/lib, which is a standard > place for the dynamic loader to look. > > >>> It should be noted that: >>> >>> 1. you need to *prefix* your PATH and LD_LIBRARY_PATH with these values >>> 2. you need to set these values in a way that will be picked up on all >>> servers that you use in your job. The safest way to do this is in your >>> shell startup files (e.g., $HOME/.bashrc or whatever is relevant for your >>> shell). >> >> I see "libtorque" in the output below - were these jobs running inside a >> queuing system? The set paths might be different therein, and need to be set >> in the job script in this case. >> > No batch system (see script above for launch mechanism). We threw a lot > of stuff MPI configure was looking for onto the system. AFAIK torque > isn't even installed. > > One possible issue is that the Rmpi module for R is not compiled by > mpicc; R has its own notions of proper options for the compiler and its > own infrastructure for building things. I did pass the location of my > local libraries into the build process. > > This seems more like an issue with the dynamic loader, or with whatever > system R is using when it loads Rmpi.so. > > Ross >> -- Reuti >> >> >>> See http://www.open-mpi.org/faq/?category=running#run-prereqs, >>> http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path, and >>> http://www.open-mpi.org/faq/?category=running#mpirun-prefix. >>> >>> Note the --prefix option that is described in the 3rd FAQ item I cited -- >>> that can be a bit easier, too. >>> >>> >>> >>> On Mar 12, 2014, at 2:51 AM, Ross Boylan wrote: >>> I took the advice here and built a personal copy of the current openmpi, to see if the problems I was having with Rmpi were a result of the old version on the system. When I do ldd on the relevant libraries (Rmpi.so is loaded dynamically by R) everything looks fine; path references that should be local are. But when I run the program and do lsof it shows that both the system and personal versions of key libraries are opened. First, does anyone know which library will actually be used, or how to tell which library is actually used, in this situation. I'm running on linux (Debian squeeze)? Second, it there some way to prevent the wrong/old/sytem libraries from being loaded? FWIW I'm still seeing the old misbehavior when I run this way, but, as I said, I'm really not sure which libraries are being used. Since Rmpi was built against the new/local ones, I think the fact that it doesn't crash means I really am using the new ones. Here are highlights of lsof on the process running R: COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME R 17634 ross cwdDIR 254,212288 150773764 /home/ross/KHC/sunbelt R 17634 ross rtdDIR8,1 4096 2 / R 17634 ross txtREG8,1 5648 3058294 /usr/lib/R/bin/exec/R R 17634 ross DELREG8,12416718
Re: [OMPI users] trying to use personal copy of 1.7.4
On Wed, 2014-03-12 at 11:50 +0100, Reuti wrote: > Am 12.03.2014 um 11:39 schrieb Jeff Squyres (jsquyres): > > > Generally, all you need to ensure that your personal copy of OMPI is used > > is to set the PATH and LD_LIBRARY_PATH to point to your new Open MPI > > installation. I do this all the time on my development cluster (where I > > have something like 6 billion different installations of OMPI available... > > mmm... should probably clean that up...) > > > > export LD_LIBRARY_PATH=path_to_my_ompi/lib:$LD_LIBRARY_PATH > > export PATH=path-to-my-ompi/bin:$PATH I believe I've already done that. The script the launches everything is (all one line originally) R_PROFILE_USER=~/KHC/sunbelt/Rmpiprofile \ LD_LIBRARY_PATH=/home/ross/install/lib:$LD_LIBRARY_PATH \ PATH=/home/ross/install/bin:$PATH orterun -x R_PROFILE_USER -x LD_LIBRARY_PATH -x PATH -hostfile ~/KHC/sunbelt/hosts \ -np 7 R --no-save -q There is a complication with R; it sticks stuff in front of LD_LIBRARY_PATH. However, the startup script Rmpiprofile fixes that, though I'm not entirely sure that is effective. However, the old libraries that are being loaded are not from any directories R added to LD_LIBRARY_PATH; instead they are from /usr/lib, which is a standard place for the dynamic loader to look. > > It should be noted that: > > > > 1. you need to *prefix* your PATH and LD_LIBRARY_PATH with these values > > 2. you need to set these values in a way that will be picked up on all > > servers that you use in your job. The safest way to do this is in your > > shell startup files (e.g., $HOME/.bashrc or whatever is relevant for your > > shell). > > I see "libtorque" in the output below - were these jobs running inside a > queuing system? The set paths might be different therein, and need to be set > in the job script in this case. > No batch system (see script above for launch mechanism). We threw a lot of stuff MPI configure was looking for onto the system. AFAIK torque isn't even installed. One possible issue is that the Rmpi module for R is not compiled by mpicc; R has its own notions of proper options for the compiler and its own infrastructure for building things. I did pass the location of my local libraries into the build process. This seems more like an issue with the dynamic loader, or with whatever system R is using when it loads Rmpi.so. Ross > -- Reuti > > > > See http://www.open-mpi.org/faq/?category=running#run-prereqs, > > http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path, and > > http://www.open-mpi.org/faq/?category=running#mpirun-prefix. > > > > Note the --prefix option that is described in the 3rd FAQ item I cited -- > > that can be a bit easier, too. > > > > > > > > On Mar 12, 2014, at 2:51 AM, Ross Boylanwrote: > > > >> I took the advice here and built a personal copy of the current openmpi, > >> to see if the problems I was having with Rmpi were a result of the old > >> version on the system. > >> > >> When I do ldd on the relevant libraries (Rmpi.so is loaded dynamically > >> by R) everything looks fine; path references that should be local are. > >> But when I run the program and do lsof it shows that both the system and > >> personal versions of key libraries are opened. > >> > >> First, does anyone know which library will actually be used, or how to > >> tell which library is actually used, in this situation. I'm running on > >> linux (Debian squeeze)? > >> > >> Second, it there some way to prevent the wrong/old/sytem libraries from > >> being loaded? > >> > >> FWIW I'm still seeing the old misbehavior when I run this way, but, as I > >> said, I'm really not sure which libraries are being used. Since Rmpi > >> was built against the new/local ones, I think the fact that it doesn't > >> crash means I really am using the new ones. > >> > >> Here are highlights of lsof on the process running R: > >> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > >> R 17634 ross cwdDIR 254,212288 150773764 > >> /home/ross/KHC/sunbelt > >> R 17634 ross rtdDIR8,1 4096 2 / > >> R 17634 ross txtREG8,1 5648 3058294 > >> /usr/lib/R/bin/exec/R > >> R 17634 ross DELREG8,12416718 > >> /tmp/openmpi-sessions-ross@n100_0/60429/1/shared_mem_pool.n100 > >> R 17634 ross memREG8,1 335240 3105336 > >> /usr/lib/openmpi/lib/libopen-pal.so.0.0.0 > >> R 17634 ross memREG8,1 304576 3105337 > >> /usr/lib/openmpi/lib/libopen-rte.so.0.0.0 > >> R 17634 ross memREG8,1 679992 3105332 > >> /usr/lib/openmpi/lib/libmpi.so.0.0.2 > >> R 17634 ross memREG8,193936 2967826 > >> /usr/lib/libz.so.1.2.3.4 > >> R 17634 ross memREG8,110648 3187256 > >>
Re: [OMPI users] trying to use personal copy of 1.7.4
My experience with Rmpi and OpenMPI is that it doesn't seem to do well with the dlopen or dynamic loading. I recently installed R 3.0.3, and Rmpi, which failed when built against our standard OpenMPI but succeeded using the following 'secret recipe'. Perhaps there is something here that will be helpful for you. ### Install openmpi 1.6.5 export PREFIX=/scratch/support_flux/ bennet/local COMPILERS='CC=gcc CXX=g++ FC=gfortran F77=gfortran' CONFIGURE_FLAGS='--disable-dlopen --enable-static' cd openmpi-1.6.5 ./configure--prefix=${PREFIX} \ --mandir=${PREFIX}/man \ --with-tm=/usr/local/torque \ --with-openib --with-psm \ --with-io-romio-flags='--with-file-system=testfs+ufs+nfs+lustre' \ $CONFIGURE_FLAGS \ $COMPILERS make make check make install ### Install R 3.0.3 wget http://cran.case.edu/src/base/R-3/R-3.0.3.tar.gz tar xzvf R-3.0.3.tar.gz cd R-3.0.3 export MPI_HOME=/scratch/support_ flux/bennet/local export LD_LIBRARY_PATH=$MPI_HOME/lib:${LD_LIBRARY_PATH} export LD_LIBRARY_PATH=$MPI_HOME/openmpi:${LD_LIBRARY_PATH} export PATH=${PATH}:${MPI_HOME}/bin export LDFLAGS='-Wl,-O1' export R_PAPERSIZE=letter export R_INST=${PREFIX} export FFLAGS='-O3 -mtune=native' export CFLAGS='-O3 -mtune=native' ./configure --prefix=${R_INST} --mandir=${R_INST}/man --enable-R-shlib --without-x make make check make install wget http://www.stats.uwo.ca/faculty/yu/Rmpi/download/linux/Rmpi_0.6-3.tar.gz R CMD INSTALL Rmpi_0.6-3.tar.gz \ --configure-args="--with-Rmpi-include=$MPI_HOME/include --with-Rmpi-libpath=$MPI_HOME/lib --with-Rmpi-type=OPENMPI" Make sure environment variables and paths are set MPI_HOME=/home/software/rhel6/openmpi-1.6.5/gcc-4.4.7-static PATH=/home/software/rhel6/openmpi-1.6.5/gcc-4.4.7-static/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH}:/home/software/rhel6/openmpi-1.6.5/gcc-4.4.7-static/lib LD_LIBRARY_PATH=$LD_LIBRARY_PATH}:/home/software/rhel6/openmpi-1.6.5/gcc-4.4.7-static/lib/openmpi PATH=/home/software/rhel6/R/3.0.3/bin:$LD_LIBRARY_PATH} LD_LIBRARY_PATH=/home/software/rhel6/R/3.0.3/lib64/R/lib:$LD_LIBRARY_PATH} ## Then install snow with R > install.packages('snow') [ . . . . I think the key thing is the --disable-dlopen, though it might require both. Jeff Squyres had a post about this quite a while ago that gives more detail about what's happening: http://www.open-mpi.org/community/lists/devel/2012/04/10840.php -- bennet
Re: [OMPI users] trying to use personal copy of 1.7.4
Perhaps there's an RPATH issue here? I don't fully understand the structure of Rmpi, but is there both an app and a library (or two separate libraries) that are linking against MPI? I.e., what we want is: app -> ~ross/OMPI \ / --> library -- But what we're getting is: app ---> /usr/OMPI \ --> library ---> ~ross/OMPI If one of them was first linked against the /usr/OMPI and managed to get an RPATH then it could override your LD_LIBRARY_PATH. -Dave On Mar 12, 2014, at 5:39 AM, Jeff Squyres (jsquyres)wrote: > Generally, all you need to ensure that your personal copy of OMPI is used is > to set the PATH and LD_LIBRARY_PATH to point to your new Open MPI > installation. I do this all the time on my development cluster (where I have > something like 6 billion different installations of OMPI available... mmm... > should probably clean that up...) > > export LD_LIBRARY_PATH=path_to_my_ompi/lib:$LD_LIBRARY_PATH > export PATH=path-to-my-ompi/bin:$PATH > > It should be noted that: > > 1. you need to *prefix* your PATH and LD_LIBRARY_PATH with these values > 2. you need to set these values in a way that will be picked up on all > servers that you use in your job. The safest way to do this is in your shell > startup files (e.g., $HOME/.bashrc or whatever is relevant for your shell). > > See http://www.open-mpi.org/faq/?category=running#run-prereqs, > http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path, and > http://www.open-mpi.org/faq/?category=running#mpirun-prefix. > > Note the --prefix option that is described in the 3rd FAQ item I cited -- > that can be a bit easier, too. > > > > On Mar 12, 2014, at 2:51 AM, Ross Boylan wrote: > >> I took the advice here and built a personal copy of the current openmpi, >> to see if the problems I was having with Rmpi were a result of the old >> version on the system. >> >> When I do ldd on the relevant libraries (Rmpi.so is loaded dynamically >> by R) everything looks fine; path references that should be local are. >> But when I run the program and do lsof it shows that both the system and >> personal versions of key libraries are opened. >> >> First, does anyone know which library will actually be used, or how to >> tell which library is actually used, in this situation. I'm running on >> linux (Debian squeeze)? >> >> Second, it there some way to prevent the wrong/old/sytem libraries from >> being loaded? >> >> FWIW I'm still seeing the old misbehavior when I run this way, but, as I >> said, I'm really not sure which libraries are being used. Since Rmpi >> was built against the new/local ones, I think the fact that it doesn't >> crash means I really am using the new ones. >> >> Here are highlights of lsof on the process running R: >> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME >> R 17634 ross cwdDIR 254,212288 150773764 >> /home/ross/KHC/sunbelt >> R 17634 ross rtdDIR8,1 4096 2 / >> R 17634 ross txtREG8,1 5648 3058294 >> /usr/lib/R/bin/exec/R >> R 17634 ross DELREG8,12416718 >> /tmp/openmpi-sessions-ross@n100_0/60429/1/shared_mem_pool.n100 >> R 17634 ross memREG8,1 335240 3105336 >> /usr/lib/openmpi/lib/libopen-pal.so.0.0.0 >> R 17634 ross memREG8,1 304576 3105337 >> /usr/lib/openmpi/lib/libopen-rte.so.0.0.0 >> R 17634 ross memREG8,1 679992 3105332 >> /usr/lib/openmpi/lib/libmpi.so.0.0.2 >> R 17634 ross memREG8,193936 2967826 >> /usr/lib/libz.so.1.2.3.4 >> R 17634 ross memREG8,110648 3187256 >> /lib/libutil-2.11.3.so >> R 17634 ross memREG8,132320 2359631 >> /usr/lib/libpciaccess.so.0.10.8 >> R 17634 ross memREG8,133368 2359338 >> /usr/lib/libnuma.so.1 >> R 17634 ross memREG 254,2 979113 152045740 >> /home/ross/install/lib/libopen-pal.so.6.1.0 >> R 17634 ross memREG8,1 183456 2359592 >> /usr/lib/libtorque.so.2.0.0 >> R 17634 ross memREG 254,2 1058125 152045781 >> /home/ross/install/lib/libopen-rte.so.7.0.0 >> R 17634 ross memREG8,149936 2359341 >> /usr/lib/libibverbs.so.1.0.0 >> R 17634 ross memREG 254,2 2802579 152045867 >> /home/ross/install/lib/libmpi.so.1.3.0 >> R 17634 ross memREG 254,2 106626 152046481 >> /home/ross/Rlib-3.0.1/Rmpi/libs/Rmpi.so >> >> So libmpi, libopen-pal, and libopen-rte all are opened in two versions and >> two locations. >> >> Thanks. >> Ross Boylan >> >> ___ >> users mailing list >>
Re: [OMPI users] Cannot run a job with more than 3 nodes
Can you verify that for all 4 nodes? I.e., something like this: foreach node (Node1 Node2 Node3 Node4) foreach other (Node1 Node2 Node3 Node 4) echo from $node to $other ssh $node ssh $other hostname On Mar 12, 2014, at 7:34 AM, Victorwrote: > Yes they are. Can resolve and log into each node, from each node, using their > "friendly" name, not IP. > > > On 12 March 2014 18:15, Jeff Squyres (jsquyres) wrote: > Are all names resolvable from all servers? > > I.e., if you "ssh Node4" from Node1, Node2, and Node3, does it work? > > > On Mar 12, 2014, at 4:07 AM, Victor wrote: > > > Hostname no I use lower case, but for some reason while I was writing > > the email I thought that upper case is clearer... > > > > The same version of Ubuntu (12.04 x64) is on all nodes and openmpi and the > > executable are shared via nfs. > > > > > > On 12 March 2014 16:01, Reuti wrote: > > Hi, > > > > Am 12.03.2014 um 07:37 schrieb Victor: > > > > > I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd > > > problem. > > > > > > I have 4 nodes, all of which are defined in the hostfile and in > > > /etc/hosts. > > > > > > I can log into each node using ssh and certificate method from the shell > > > that is running the mpi job, by sing their name as defined in /etc/hosts. > > > > > > I can run an mpi job if I include only 3 nodes in the hostfile, for > > > example: > > > > > > Node1 slots=8 max-slots=8 > > > Node2 slots=8 max-slots=8 > > > Node3 slots=8 max-slots=8 > > > > You are using an uppercase name here by intention - this is the one the > > host returns by `hostname`? Although it is allowed and should be mangled to > > lowercase resp. ignored for hostname resolution, I found that not all > > programs are doing it. Best is to use only lowercase characters is my > > experience. > > > > The same version of your Ubuntu Linux is installed on all machines? > > > > -- Reuti > > > > > > > But if I add a fourth node into the hostfile eg: > > > > > > Node1 slots=8 max-slots=8 > > > Node2 slots=8 max-slots=8 > > > Node3 slots=8 max-slots=8 > > > Node4 slots=8 max-slots=8 > > > > > > I get this error after attempting mpirun -np 32 --hostfile hostfile a.out: > > > > > > ssh: Could not resolve hostname Node4: Name or service not known. > > > > > > But, I can log into Node4 using ssh from the same shell by using ssh > > > Node4. > > > > > > Also if I mix up the hostfile like this for example and place Node1 to > > > the last spot: > > > > > > Node4 slots=8 max-slots=8 > > > Node2 slots=8 max-slots=8 > > > Node3 slots=8 max-slots=8 > > > Node1 slots=8 max-slots=8 > > > > > > The error becomes > > > > > > ssh: Could not resolve hostname Node1: Name or service not known. > > > > > > If I then go back to the three node hostfile like this: > > > > > > Node1 slots=8 max-slots=8 > > > Node4 slots=8 max-slots=8 > > > Node2 slots=8 max-slots=8 > > > > > > There is no error with three nodes even though both Node1 and Node4 > > > "cannot be found" if they are present in a 4 node hostfile in the last > > > spot. The last slot seems to be bugged. > > > > > > What is going on? How do I fix this? > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Cannot run a job with more than 3 nodes
Yes they are. Can resolve and log into each node, from each node, using their "friendly" name, not IP. On 12 March 2014 18:15, Jeff Squyres (jsquyres)wrote: > Are all names resolvable from all servers? > > I.e., if you "ssh Node4" from Node1, Node2, and Node3, does it work? > > > On Mar 12, 2014, at 4:07 AM, Victor wrote: > > > Hostname no I use lower case, but for some reason while I was > writing the email I thought that upper case is clearer... > > > > The same version of Ubuntu (12.04 x64) is on all nodes and openmpi and > the executable are shared via nfs. > > > > > > On 12 March 2014 16:01, Reuti wrote: > > Hi, > > > > Am 12.03.2014 um 07:37 schrieb Victor: > > > > > I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd > problem. > > > > > > I have 4 nodes, all of which are defined in the hostfile and in > /etc/hosts. > > > > > > I can log into each node using ssh and certificate method from the > shell that is running the mpi job, by sing their name as defined in > /etc/hosts. > > > > > > I can run an mpi job if I include only 3 nodes in the hostfile, for > example: > > > > > > Node1 slots=8 max-slots=8 > > > Node2 slots=8 max-slots=8 > > > Node3 slots=8 max-slots=8 > > > > You are using an uppercase name here by intention - this is the one the > host returns by `hostname`? Although it is allowed and should be mangled to > lowercase resp. ignored for hostname resolution, I found that not all > programs are doing it. Best is to use only lowercase characters is my > experience. > > > > The same version of your Ubuntu Linux is installed on all machines? > > > > -- Reuti > > > > > > > But if I add a fourth node into the hostfile eg: > > > > > > Node1 slots=8 max-slots=8 > > > Node2 slots=8 max-slots=8 > > > Node3 slots=8 max-slots=8 > > > Node4 slots=8 max-slots=8 > > > > > > I get this error after attempting mpirun -np 32 --hostfile hostfile > a.out: > > > > > > ssh: Could not resolve hostname Node4: Name or service not known. > > > > > > But, I can log into Node4 using ssh from the same shell by using ssh > Node4. > > > > > > Also if I mix up the hostfile like this for example and place Node1 to > the last spot: > > > > > > Node4 slots=8 max-slots=8 > > > Node2 slots=8 max-slots=8 > > > Node3 slots=8 max-slots=8 > > > Node1 slots=8 max-slots=8 > > > > > > The error becomes > > > > > > ssh: Could not resolve hostname Node1: Name or service not known. > > > > > > If I then go back to the three node hostfile like this: > > > > > > Node1 slots=8 max-slots=8 > > > Node4 slots=8 max-slots=8 > > > Node2 slots=8 max-slots=8 > > > > > > There is no error with three nodes even though both Node1 and Node4 > "cannot be found" if they are present in a 4 node hostfile in the last > spot. The last slot seems to be bugged. > > > > > > What is going on? How do I fix this? > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] trying to use personal copy of 1.7.4
Am 12.03.2014 um 11:39 schrieb Jeff Squyres (jsquyres): > Generally, all you need to ensure that your personal copy of OMPI is used is > to set the PATH and LD_LIBRARY_PATH to point to your new Open MPI > installation. I do this all the time on my development cluster (where I have > something like 6 billion different installations of OMPI available... mmm... > should probably clean that up...) > > export LD_LIBRARY_PATH=path_to_my_ompi/lib:$LD_LIBRARY_PATH > export PATH=path-to-my-ompi/bin:$PATH > > It should be noted that: > > 1. you need to *prefix* your PATH and LD_LIBRARY_PATH with these values > 2. you need to set these values in a way that will be picked up on all > servers that you use in your job. The safest way to do this is in your shell > startup files (e.g., $HOME/.bashrc or whatever is relevant for your shell). I see "libtorque" in the output below - were these jobs running inside a queuing system? The set paths might be different therein, and need to be set in the job script in this case. -- Reuti > See http://www.open-mpi.org/faq/?category=running#run-prereqs, > http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path, and > http://www.open-mpi.org/faq/?category=running#mpirun-prefix. > > Note the --prefix option that is described in the 3rd FAQ item I cited -- > that can be a bit easier, too. > > > > On Mar 12, 2014, at 2:51 AM, Ross Boylanwrote: > >> I took the advice here and built a personal copy of the current openmpi, >> to see if the problems I was having with Rmpi were a result of the old >> version on the system. >> >> When I do ldd on the relevant libraries (Rmpi.so is loaded dynamically >> by R) everything looks fine; path references that should be local are. >> But when I run the program and do lsof it shows that both the system and >> personal versions of key libraries are opened. >> >> First, does anyone know which library will actually be used, or how to >> tell which library is actually used, in this situation. I'm running on >> linux (Debian squeeze)? >> >> Second, it there some way to prevent the wrong/old/sytem libraries from >> being loaded? >> >> FWIW I'm still seeing the old misbehavior when I run this way, but, as I >> said, I'm really not sure which libraries are being used. Since Rmpi >> was built against the new/local ones, I think the fact that it doesn't >> crash means I really am using the new ones. >> >> Here are highlights of lsof on the process running R: >> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME >> R 17634 ross cwdDIR 254,212288 150773764 >> /home/ross/KHC/sunbelt >> R 17634 ross rtdDIR8,1 4096 2 / >> R 17634 ross txtREG8,1 5648 3058294 >> /usr/lib/R/bin/exec/R >> R 17634 ross DELREG8,12416718 >> /tmp/openmpi-sessions-ross@n100_0/60429/1/shared_mem_pool.n100 >> R 17634 ross memREG8,1 335240 3105336 >> /usr/lib/openmpi/lib/libopen-pal.so.0.0.0 >> R 17634 ross memREG8,1 304576 3105337 >> /usr/lib/openmpi/lib/libopen-rte.so.0.0.0 >> R 17634 ross memREG8,1 679992 3105332 >> /usr/lib/openmpi/lib/libmpi.so.0.0.2 >> R 17634 ross memREG8,193936 2967826 >> /usr/lib/libz.so.1.2.3.4 >> R 17634 ross memREG8,110648 3187256 >> /lib/libutil-2.11.3.so >> R 17634 ross memREG8,132320 2359631 >> /usr/lib/libpciaccess.so.0.10.8 >> R 17634 ross memREG8,133368 2359338 >> /usr/lib/libnuma.so.1 >> R 17634 ross memREG 254,2 979113 152045740 >> /home/ross/install/lib/libopen-pal.so.6.1.0 >> R 17634 ross memREG8,1 183456 2359592 >> /usr/lib/libtorque.so.2.0.0 >> R 17634 ross memREG 254,2 1058125 152045781 >> /home/ross/install/lib/libopen-rte.so.7.0.0 >> R 17634 ross memREG8,149936 2359341 >> /usr/lib/libibverbs.so.1.0.0 >> R 17634 ross memREG 254,2 2802579 152045867 >> /home/ross/install/lib/libmpi.so.1.3.0 >> R 17634 ross memREG 254,2 106626 152046481 >> /home/ross/Rlib-3.0.1/Rmpi/libs/Rmpi.so >> >> So libmpi, libopen-pal, and libopen-rte all are opened in two versions and >> two locations. >> >> Thanks. >> Ross Boylan >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > users mailing list > us...@open-mpi.org >
Re: [OMPI users] trying to use personal copy of 1.7.4
Generally, all you need to ensure that your personal copy of OMPI is used is to set the PATH and LD_LIBRARY_PATH to point to your new Open MPI installation. I do this all the time on my development cluster (where I have something like 6 billion different installations of OMPI available... mmm... should probably clean that up...) export LD_LIBRARY_PATH=path_to_my_ompi/lib:$LD_LIBRARY_PATH export PATH=path-to-my-ompi/bin:$PATH It should be noted that: 1. you need to *prefix* your PATH and LD_LIBRARY_PATH with these values 2. you need to set these values in a way that will be picked up on all servers that you use in your job. The safest way to do this is in your shell startup files (e.g., $HOME/.bashrc or whatever is relevant for your shell). See http://www.open-mpi.org/faq/?category=running#run-prereqs, http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path, and http://www.open-mpi.org/faq/?category=running#mpirun-prefix. Note the --prefix option that is described in the 3rd FAQ item I cited -- that can be a bit easier, too. On Mar 12, 2014, at 2:51 AM, Ross Boylanwrote: > I took the advice here and built a personal copy of the current openmpi, > to see if the problems I was having with Rmpi were a result of the old > version on the system. > > When I do ldd on the relevant libraries (Rmpi.so is loaded dynamically > by R) everything looks fine; path references that should be local are. > But when I run the program and do lsof it shows that both the system and > personal versions of key libraries are opened. > > First, does anyone know which library will actually be used, or how to > tell which library is actually used, in this situation. I'm running on > linux (Debian squeeze)? > > Second, it there some way to prevent the wrong/old/sytem libraries from > being loaded? > > FWIW I'm still seeing the old misbehavior when I run this way, but, as I > said, I'm really not sure which libraries are being used. Since Rmpi > was built against the new/local ones, I think the fact that it doesn't > crash means I really am using the new ones. > > Here are highlights of lsof on the process running R: > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > R 17634 ross cwdDIR 254,212288 150773764 > /home/ross/KHC/sunbelt > R 17634 ross rtdDIR8,1 4096 2 / > R 17634 ross txtREG8,1 5648 3058294 > /usr/lib/R/bin/exec/R > R 17634 ross DELREG8,12416718 > /tmp/openmpi-sessions-ross@n100_0/60429/1/shared_mem_pool.n100 > R 17634 ross memREG8,1 335240 3105336 > /usr/lib/openmpi/lib/libopen-pal.so.0.0.0 > R 17634 ross memREG8,1 304576 3105337 > /usr/lib/openmpi/lib/libopen-rte.so.0.0.0 > R 17634 ross memREG8,1 679992 3105332 > /usr/lib/openmpi/lib/libmpi.so.0.0.2 > R 17634 ross memREG8,193936 2967826 > /usr/lib/libz.so.1.2.3.4 > R 17634 ross memREG8,110648 3187256 > /lib/libutil-2.11.3.so > R 17634 ross memREG8,132320 2359631 > /usr/lib/libpciaccess.so.0.10.8 > R 17634 ross memREG8,133368 2359338 > /usr/lib/libnuma.so.1 > R 17634 ross memREG 254,2 979113 152045740 > /home/ross/install/lib/libopen-pal.so.6.1.0 > R 17634 ross memREG8,1 183456 2359592 > /usr/lib/libtorque.so.2.0.0 > R 17634 ross memREG 254,2 1058125 152045781 > /home/ross/install/lib/libopen-rte.so.7.0.0 > R 17634 ross memREG8,149936 2359341 > /usr/lib/libibverbs.so.1.0.0 > R 17634 ross memREG 254,2 2802579 152045867 > /home/ross/install/lib/libmpi.so.1.3.0 > R 17634 ross memREG 254,2 106626 152046481 > /home/ross/Rlib-3.0.1/Rmpi/libs/Rmpi.so > > So libmpi, libopen-pal, and libopen-rte all are opened in two versions and > two locations. > > Thanks. > Ross Boylan > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] ctrl+c to abort a job with openmpi-1.7.5rc2
Thanks, Jeff. I really understood the situation. Tetsuya > This all seems to be a side-effect of r30942 -- see: > > https://svn.open-mpi.org/trac/ompi/ticket/4365 > > > On Mar 12, 2014, at 5:13 AM,wrote: > > > > > > > Hi Ralph, > > > > I installed openmpi-1.7.5rc2 and applied r31019 to it. > > As far as I confirmed, rmaps framework worked fine. > > > > However, by chance, I noticed that single ctrl+c typing could > > not terminate a running job. Twice typing was necessary. > > Is this your expected behavior? > > > > I didn't use ctrl+c to abort for a while, I don't know when > > it started. At least I can terminate the job by single ctrl+c > > if I use openmpi-1.7.4. > > > > And, for your information, when I hit ctrl+c with more than 5 > > seconds interval, I get the message below: > > Abort is in progress...hit ctrl-c again within 5 seconds to forcibly > > terminate > > > > Tetsuya > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] ctrl+c to abort a job with openmpi-1.7.5rc2
This all seems to be a side-effect of r30942 -- see: https://svn.open-mpi.org/trac/ompi/ticket/4365 On Mar 12, 2014, at 5:13 AM,wrote: > > > Hi Ralph, > > I installed openmpi-1.7.5rc2 and applied r31019 to it. > As far as I confirmed, rmaps framework worked fine. > > However, by chance, I noticed that single ctrl+c typing could > not terminate a running job. Twice typing was necessary. > Is this your expected behavior? > > I didn't use ctrl+c to abort for a while, I don't know when > it started. At least I can terminate the job by single ctrl+c > if I use openmpi-1.7.4. > > And, for your information, when I hit ctrl+c with more than 5 > seconds interval, I get the message below: > Abort is in progress...hit ctrl-c again within 5 seconds to forcibly > terminate > > Tetsuya > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Cannot run a job with more than 3 nodes
Are all names resolvable from all servers? I.e., if you "ssh Node4" from Node1, Node2, and Node3, does it work? On Mar 12, 2014, at 4:07 AM, Victorwrote: > Hostname no I use lower case, but for some reason while I was writing the > email I thought that upper case is clearer... > > The same version of Ubuntu (12.04 x64) is on all nodes and openmpi and the > executable are shared via nfs. > > > On 12 March 2014 16:01, Reuti wrote: > Hi, > > Am 12.03.2014 um 07:37 schrieb Victor: > > > I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd problem. > > > > I have 4 nodes, all of which are defined in the hostfile and in /etc/hosts. > > > > I can log into each node using ssh and certificate method from the shell > > that is running the mpi job, by sing their name as defined in /etc/hosts. > > > > I can run an mpi job if I include only 3 nodes in the hostfile, for example: > > > > Node1 slots=8 max-slots=8 > > Node2 slots=8 max-slots=8 > > Node3 slots=8 max-slots=8 > > You are using an uppercase name here by intention - this is the one the host > returns by `hostname`? Although it is allowed and should be mangled to > lowercase resp. ignored for hostname resolution, I found that not all > programs are doing it. Best is to use only lowercase characters is my > experience. > > The same version of your Ubuntu Linux is installed on all machines? > > -- Reuti > > > > But if I add a fourth node into the hostfile eg: > > > > Node1 slots=8 max-slots=8 > > Node2 slots=8 max-slots=8 > > Node3 slots=8 max-slots=8 > > Node4 slots=8 max-slots=8 > > > > I get this error after attempting mpirun -np 32 --hostfile hostfile a.out: > > > > ssh: Could not resolve hostname Node4: Name or service not known. > > > > But, I can log into Node4 using ssh from the same shell by using ssh Node4. > > > > Also if I mix up the hostfile like this for example and place Node1 to the > > last spot: > > > > Node4 slots=8 max-slots=8 > > Node2 slots=8 max-slots=8 > > Node3 slots=8 max-slots=8 > > Node1 slots=8 max-slots=8 > > > > The error becomes > > > > ssh: Could not resolve hostname Node1: Name or service not known. > > > > If I then go back to the three node hostfile like this: > > > > Node1 slots=8 max-slots=8 > > Node4 slots=8 max-slots=8 > > Node2 slots=8 max-slots=8 > > > > There is no error with three nodes even though both Node1 and Node4 "cannot > > be found" if they are present in a 4 node hostfile in the last spot. The > > last slot seems to be bugged. > > > > What is going on? How do I fix this? > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI users] ctrl+c to abort a job with openmpi-1.7.5rc2
Hi Ralph, I installed openmpi-1.7.5rc2 and applied r31019 to it. As far as I confirmed, rmaps framework worked fine. However, by chance, I noticed that single ctrl+c typing could not terminate a running job. Twice typing was necessary. Is this your expected behavior? I didn't use ctrl+c to abort for a while, I don't know when it started. At least I can terminate the job by single ctrl+c if I use openmpi-1.7.4. And, for your information, when I hit ctrl+c with more than 5 seconds interval, I get the message below: Abort is in progress...hit ctrl-c again within 5 seconds to forcibly terminate Tetsuya
Re: [OMPI users] Cannot run a job with more than 3 nodes
Hostname no I use lower case, but for some reason while I was writing the email I thought that upper case is clearer... The same version of Ubuntu (12.04 x64) is on all nodes and openmpi and the executable are shared via nfs. On 12 March 2014 16:01, Reutiwrote: > Hi, > > Am 12.03.2014 um 07:37 schrieb Victor: > > > I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd > problem. > > > > I have 4 nodes, all of which are defined in the hostfile and in > /etc/hosts. > > > > I can log into each node using ssh and certificate method from the shell > that is running the mpi job, by sing their name as defined in /etc/hosts. > > > > I can run an mpi job if I include only 3 nodes in the hostfile, for > example: > > > > Node1 slots=8 max-slots=8 > > Node2 slots=8 max-slots=8 > > Node3 slots=8 max-slots=8 > > You are using an uppercase name here by intention - this is the one the > host returns by `hostname`? Although it is allowed and should be mangled to > lowercase resp. ignored for hostname resolution, I found that not all > programs are doing it. Best is to use only lowercase characters is my > experience. > > The same version of your Ubuntu Linux is installed on all machines? > > -- Reuti > > > > But if I add a fourth node into the hostfile eg: > > > > Node1 slots=8 max-slots=8 > > Node2 slots=8 max-slots=8 > > Node3 slots=8 max-slots=8 > > Node4 slots=8 max-slots=8 > > > > I get this error after attempting mpirun -np 32 --hostfile hostfile > a.out: > > > > ssh: Could not resolve hostname Node4: Name or service not known. > > > > But, I can log into Node4 using ssh from the same shell by using ssh > Node4. > > > > Also if I mix up the hostfile like this for example and place Node1 to > the last spot: > > > > Node4 slots=8 max-slots=8 > > Node2 slots=8 max-slots=8 > > Node3 slots=8 max-slots=8 > > Node1 slots=8 max-slots=8 > > > > The error becomes > > > > ssh: Could not resolve hostname Node1: Name or service not known. > > > > If I then go back to the three node hostfile like this: > > > > Node1 slots=8 max-slots=8 > > Node4 slots=8 max-slots=8 > > Node2 slots=8 max-slots=8 > > > > There is no error with three nodes even though both Node1 and Node4 > "cannot be found" if they are present in a 4 node hostfile in the last > spot. The last slot seems to be bugged. > > > > What is going on? How do I fix this? > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Cannot run a job with more than 3 nodes
Hi, Am 12.03.2014 um 07:37 schrieb Victor: > I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd problem. > > I have 4 nodes, all of which are defined in the hostfile and in /etc/hosts. > > I can log into each node using ssh and certificate method from the shell that > is running the mpi job, by sing their name as defined in /etc/hosts. > > I can run an mpi job if I include only 3 nodes in the hostfile, for example: > > Node1 slots=8 max-slots=8 > Node2 slots=8 max-slots=8 > Node3 slots=8 max-slots=8 You are using an uppercase name here by intention - this is the one the host returns by `hostname`? Although it is allowed and should be mangled to lowercase resp. ignored for hostname resolution, I found that not all programs are doing it. Best is to use only lowercase characters is my experience. The same version of your Ubuntu Linux is installed on all machines? -- Reuti > But if I add a fourth node into the hostfile eg: > > Node1 slots=8 max-slots=8 > Node2 slots=8 max-slots=8 > Node3 slots=8 max-slots=8 > Node4 slots=8 max-slots=8 > > I get this error after attempting mpirun -np 32 --hostfile hostfile a.out: > > ssh: Could not resolve hostname Node4: Name or service not known. > > But, I can log into Node4 using ssh from the same shell by using ssh Node4. > > Also if I mix up the hostfile like this for example and place Node1 to the > last spot: > > Node4 slots=8 max-slots=8 > Node2 slots=8 max-slots=8 > Node3 slots=8 max-slots=8 > Node1 slots=8 max-slots=8 > > The error becomes > > ssh: Could not resolve hostname Node1: Name or service not known. > > If I then go back to the three node hostfile like this: > > Node1 slots=8 max-slots=8 > Node4 slots=8 max-slots=8 > Node2 slots=8 max-slots=8 > > There is no error with three nodes even though both Node1 and Node4 "cannot > be found" if they are present in a 4 node hostfile in the last spot. The last > slot seems to be bugged. > > What is going on? How do I fix this? > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Cannot run a job with more than 3 nodes
I "fixed it" by finding the message regarding tree spawn in a thread from November 2013. When I run the job with -mca plm_rsh_no_tree_spawn 1 the job works over 4 nodes. I cannot identify any errors in ssh key setup and since I am only using 4 nodes I am not concerned about somewhat slower launch speed. Is faster job launch speed the only benefit of tree spawn?
[OMPI users] trying to use personal copy of 1.7.4
I took the advice here and built a personal copy of the current openmpi, to see if the problems I was having with Rmpi were a result of the old version on the system. When I do ldd on the relevant libraries (Rmpi.so is loaded dynamically by R) everything looks fine; path references that should be local are. But when I run the program and do lsof it shows that both the system and personal versions of key libraries are opened. First, does anyone know which library will actually be used, or how to tell which library is actually used, in this situation. I'm running on linux (Debian squeeze)? Second, it there some way to prevent the wrong/old/sytem libraries from being loaded? FWIW I'm still seeing the old misbehavior when I run this way, but, as I said, I'm really not sure which libraries are being used. Since Rmpi was built against the new/local ones, I think the fact that it doesn't crash means I really am using the new ones. Here are highlights of lsof on the process running R: COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME R 17634 ross cwdDIR 254,212288 150773764 /home/ross/KHC/sunbelt R 17634 ross rtdDIR8,1 4096 2 / R 17634 ross txtREG8,1 5648 3058294 /usr/lib/R/bin/exec/R R 17634 ross DELREG8,12416718 /tmp/openmpi-sessions-ross@n100_0/60429/1/shared_mem_pool.n100 R 17634 ross memREG8,1 335240 3105336 /usr/lib/openmpi/lib/libopen-pal.so.0.0.0 R 17634 ross memREG8,1 304576 3105337 /usr/lib/openmpi/lib/libopen-rte.so.0.0.0 R 17634 ross memREG8,1 679992 3105332 /usr/lib/openmpi/lib/libmpi.so.0.0.2 R 17634 ross memREG8,193936 2967826 /usr/lib/libz.so.1.2.3.4 R 17634 ross memREG8,110648 3187256 /lib/libutil-2.11.3.so R 17634 ross memREG8,132320 2359631 /usr/lib/libpciaccess.so.0.10.8 R 17634 ross memREG8,133368 2359338 /usr/lib/libnuma.so.1 R 17634 ross memREG 254,2 979113 152045740 /home/ross/install/lib/libopen-pal.so.6.1.0 R 17634 ross memREG8,1 183456 2359592 /usr/lib/libtorque.so.2.0.0 R 17634 ross memREG 254,2 1058125 152045781 /home/ross/install/lib/libopen-rte.so.7.0.0 R 17634 ross memREG8,149936 2359341 /usr/lib/libibverbs.so.1.0.0 R 17634 ross memREG 254,2 2802579 152045867 /home/ross/install/lib/libmpi.so.1.3.0 R 17634 ross memREG 254,2 106626 152046481 /home/ross/Rlib-3.0.1/Rmpi/libs/Rmpi.so So libmpi, libopen-pal, and libopen-rte all are opened in two versions and two locations. Thanks. Ross Boylan
[OMPI users] Cannot run a job with more than 3 nodes
I am using openmpi 1.7.4 on Ubuntu 12.04 x64 and I have a very odd problem. I have 4 nodes, all of which are defined in the hostfile and in /etc/hosts. I can log into each node using ssh and certificate method from the shell that is running the mpi job, by sing their name as defined in /etc/hosts. I can run an mpi job if I include only 3 nodes in the hostfile, for example: Node1 slots=8 max-slots=8 Node2 slots=8 max-slots=8 Node3 slots=8 max-slots=8 But if I add a fourth node into the hostfile eg: Node1 slots=8 max-slots=8 Node2 slots=8 max-slots=8 Node3 slots=8 max-slots=8 Node4 slots=8 max-slots=8 I get this error after attempting mpirun -np 32 --hostfile hostfile a.out: ssh: Could not resolve hostname Node4: Name or service not known. But, I can log into Node4 using ssh from the same shell by using ssh Node4. Also if I mix up the hostfile like this for example and place Node1 to the last spot: Node4 slots=8 max-slots=8 Node2 slots=8 max-slots=8 Node3 slots=8 max-slots=8 Node1 slots=8 max-slots=8 The error becomes ssh: Could not resolve hostname Node1: Name or service not known. If I then go back to the three node hostfile like this: Node1 slots=8 max-slots=8 Node4 slots=8 max-slots=8 Node2 slots=8 max-slots=8 There is no error with three nodes even though both Node1 and Node4 "cannot be found" if they are present in a 4 node hostfile in the last spot. The last slot seems to be bugged. What is going on? How do I fix this?