Are these NICs on the same IP subnet, per chance? You don't have to specify 
them by name - you can say "-mca btl_tcp_if_include 10.10/16" or something.

On May 6, 2014, at 4:50 PM, Clay Kirkland <clay.kirkl...@versityinc.com> wrote:

>  Well it turns out  I can't seem to get all three of my machines on the same 
> page.
> Two of them are using eth0 and one is using eth1.   Centos seems unable to 
> bring 
> up multiple network interfaces for some reason and when I use the mca param 
> to 
> use eth0 it works on two machines but not the other.   Is there some way to 
> use 
> only eth1 on one host and only eth0 on the other two?   Maybe environment 
> variables
> but I can't seem to get that to work either.
> 
>  Clay
> 
> 
> On Tue, May 6, 2014 at 6:28 PM, Clay Kirkland <clay.kirkl...@versityinc.com> 
> wrote:
>  That last trick seems to work.  I can get it to work once in a while with 
> those tcp options but it is
> tricky as I have three machines and two of them use eth0 as primary network 
> interface and one 
> uses eth1.   But by fiddling with network options and perhaps moving a cable 
> or two I think I can
> get it all to work    Thanks much for the tip.
> 
>  Clay
> 
> 
> On Tue, May 6, 2014 at 11:00 AM, <users-requ...@open-mpi.org> wrote:
> Send users mailing list submissions to
>         us...@open-mpi.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
>         users-requ...@open-mpi.org
> 
> You can reach the person managing the list at
>         users-ow...@open-mpi.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
> 
> 
> Today's Topics:
> 
>    1. Re: MPI_Barrier hangs on second attempt but only  when
>       multiple hosts used. (Daniels, Marcus G)
>    2. ROMIO bug reading darrays (Richard Shaw)
>    3. MPI File Open does not work (Imran Ali)
>    4. Re: MPI File Open does not work (Jeff Squyres (jsquyres))
>    5. Re: MPI File Open does not work (Imran Ali)
>    6. Re: MPI File Open does not work (Jeff Squyres (jsquyres))
>    7. Re: MPI File Open does not work (Imran Ali)
>    8. Re: MPI File Open does not work (Jeff Squyres (jsquyres))
>    9. Re: users Digest, Vol 2879, Issue 1 (Jeff Squyres (jsquyres))
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Mon, 5 May 2014 19:28:07 +0000
> From: "Daniels, Marcus G" <mdani...@lanl.gov>
> To: "'us...@open-mpi.org'" <us...@open-mpi.org>
> Subject: Re: [OMPI users] MPI_Barrier hangs on second attempt but only
>         when    multiple hosts used.
> Message-ID:
>         <532c594b7920a549a2a91cb4312cc57640dc5...@ecs-exg-p-mb01.win.lanl.gov>
> Content-Type: text/plain; charset="utf-8"
> 
> 
> 
> From: Clay Kirkland [mailto:clay.kirkl...@versityinc.com]
> Sent: Friday, May 02, 2014 03:24 PM
> To: us...@open-mpi.org <us...@open-mpi.org>
> Subject: [OMPI users] MPI_Barrier hangs on second attempt but only when 
> multiple hosts used.
> 
> I have been using MPI for many many years so I have very well debugged mpi 
> tests.   I am
> having trouble on either openmpi-1.4.5  or  openmpi-1.6.5 versions though 
> with getting the
> MPI_Barrier calls to work.   It works fine when I run all processes on one 
> machine but when
> I run with two or more hosts the second call to MPI_Barrier always hangs.   
> Not the first one,
> but always the second one.   I looked at FAQ's and such but found nothing 
> except for a comment
> that MPI_Barrier problems were often problems with fire walls.  Also 
> mentioned as a problem
> was not having the same version of mpi on both machines.  I turned firewalls 
> off and removed
> and reinstalled the same version on both hosts but I still see the same 
> thing.   I then installed
> lam mpi on two of my machines and that works fine.   I can call the 
> MPI_Barrier function when run on
> one of two machines by itself  many times with no hangs.  Only hangs if two 
> or more hosts are involved.
> These runs are all being done on CentOS release 6.4.   Here is test program I 
> used.
> 
> main (argc, argv)
> int argc;
> char **argv;
> {
>     char message[20];
>     char hoster[256];
>     char nameis[256];
>     int fd, i, j, jnp, iret, myrank, np, ranker, recker;
>     MPI_Comm comm;
>     MPI_Status status;
> 
>     MPI_Init( &argc, &argv );
>     MPI_Comm_rank( MPI_COMM_WORLD, &myrank);
>     MPI_Comm_size( MPI_COMM_WORLD, &np);
> 
>         gethostname(hoster,256);
> 
>         printf(" In rank %d and host= %s  Do Barrier call 
> 1.\n",myrank,hoster);
>     MPI_Barrier(MPI_COMM_WORLD);
>         printf(" In rank %d and host= %s  Do Barrier call 
> 2.\n",myrank,hoster);
>     MPI_Barrier(MPI_COMM_WORLD);
>         printf(" In rank %d and host= %s  Do Barrier call 
> 3.\n",myrank,hoster);
>     MPI_Barrier(MPI_COMM_WORLD);
>     MPI_Finalize();
>     exit(0);
> }
> 
>   Here are three runs of test program.  First with two processes on one host, 
> then with
> two processes on another host, and finally with one process on each of two 
> hosts.  The
> first two runs are fine but the last run hangs on the second MPI_Barrier.
> 
> [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out
>  In rank 0 and host= centos  Do Barrier call 1.
>  In rank 1 and host= centos  Do Barrier call 1.
>  In rank 1 and host= centos  Do Barrier call 2.
>  In rank 1 and host= centos  Do Barrier call 3.
>  In rank 0 and host= centos  Do Barrier call 2.
>  In rank 0 and host= centos  Do Barrier call 3.
> [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out
> /root/.bashrc: line 14: unalias: ls: not found
>  In rank 0 and host= RAID  Do Barrier call 1.
>  In rank 0 and host= RAID  Do Barrier call 2.
>  In rank 0 and host= RAID  Do Barrier call 3.
>  In rank 1 and host= RAID  Do Barrier call 1.
>  In rank 1 and host= RAID  Do Barrier call 2.
>  In rank 1 and host= RAID  Do Barrier call 3.
> [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID a.out
> /root/.bashrc: line 14: unalias: ls: not found
>  In rank 0 and host= centos  Do Barrier call 1.
>  In rank 0 and host= centos  Do Barrier call 2.
> In rank 1 and host= RAID  Do Barrier call 1.
>  In rank 1 and host= RAID  Do Barrier call 2.
> 
>   Since it is such a simple test and problem and such a widely used MPI 
> function, it must obviously
> be an installation or configuration problem.   A pstack for each of the hung 
> MPI_Barrier processes
> on the two machines shows this:
> 
> [root@centos ~]# pstack 31666
> #0  0x0000003baf0e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> #1  0x00007f5de06125eb in epoll_dispatch () from /usr/local/lib/libmpi.so.1
> #2  0x00007f5de061475a in opal_event_base_loop () from 
> /usr/local/lib/libmpi.so.1
> #3  0x00007f5de0639229 in opal_progress () from /usr/local/lib/libmpi.so.1
> #4  0x00007f5de0586f75 in ompi_request_default_wait_all () from 
> /usr/local/lib/libmpi.so.1
> #5  0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from 
> /usr/local/lib/openmpi/mca_coll_tuned.so
> #6  0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs () from 
> /usr/local/lib/openmpi/mca_coll_tuned.so
> #7  0x00007f5de05941c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1
> #8  0x0000000000400a43 in main ()
> 
> [root@RAID openmpi-1.6.5]# pstack 22167
> #0  0x00000030302e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> #1  0x00007f7ee46885eb in epoll_dispatch () from /usr/local/lib/libmpi.so.1
> #2  0x00007f7ee468a75a in opal_event_base_loop () from 
> /usr/local/lib/libmpi.so.1
> #3  0x00007f7ee46af229 in opal_progress () from /usr/local/lib/libmpi.so.1
> #4  0x00007f7ee45fcf75 in ompi_request_default_wait_all () from 
> /usr/local/lib/libmpi.so.1
> #5  0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from 
> /usr/local/lib/openmpi/mca_coll_tuned.so
> #6  0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs () from 
> /usr/local/lib/openmpi/mca_coll_tuned.so
> #7  0x00007f7ee460a1c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1
> #8  0x0000000000400a43 in main ()
> 
>  Which looks exactly the same on each machine.  Any thoughts or ideas would 
> be greatly appreciated as
> I am stuck.
> 
>  Clay Kirkland
> 
> 
> 
> 
> 
> 
> 
> 
> -------------- next part --------------
> HTML attachment scrubbed and removed
> 
> ------------------------------
> 
> Message: 2
> Date: Mon, 5 May 2014 22:20:59 -0400
> From: Richard Shaw <jr...@cita.utoronto.ca>
> To: Open MPI Users <us...@open-mpi.org>
> Subject: [OMPI users] ROMIO bug reading darrays
> Message-ID:
>         <can+evmkc+9kacnpausscziufwdj3jfcsymb-8zdx1etdkab...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
> 
> Hello,
> 
> I think I've come across a bug when using ROMIO to read in a 2D distributed
> array. I've attached a test case to this email.
> 
> In the testcase I first initialise an array of 25 doubles (which will be a
> 5x5 grid), then I create a datatype representing a 5x5 matrix distributed
> in 3x3 blocks over a 2x2 process grid. As a reference I use MPI_Pack to
> pull out the block cyclic array elements local to each process (which I
> think is correct). Then I write the original array of 25 doubles to disk,
> and use MPI-IO to read it back in (performing the Open, Set_view, and
> Real_all), and compare to the reference.
> 
> Running this with OMPI, the two match on all ranks.
> 
> > mpirun -mca io ompio -np 4 ./darr_read.x
> === Rank 0 === (9 elements)
> Packed:  0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0
> Read:    0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0
> 
> === Rank 1 === (6 elements)
> Packed: 15.0 16.0 17.0 20.0 21.0 22.0
> Read:   15.0 16.0 17.0 20.0 21.0 22.0
> 
> === Rank 2 === (6 elements)
> Packed:  3.0  4.0  8.0  9.0 13.0 14.0
> Read:    3.0  4.0  8.0  9.0 13.0 14.0
> 
> === Rank 3 === (4 elements)
> Packed: 18.0 19.0 23.0 24.0
> Read:   18.0 19.0 23.0 24.0
> 
> 
> 
> However, using ROMIO the two differ on two of the ranks:
> 
> > mpirun -mca io romio -np 4 ./darr_read.x
> === Rank 0 === (9 elements)
> Packed:  0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0
> Read:    0.0  1.0  2.0  5.0  6.0  7.0 10.0 11.0 12.0
> 
> === Rank 1 === (6 elements)
> Packed: 15.0 16.0 17.0 20.0 21.0 22.0
> Read:    0.0  1.0  2.0  0.0  1.0  2.0
> 
> === Rank 2 === (6 elements)
> Packed:  3.0  4.0  8.0  9.0 13.0 14.0
> Read:    3.0  4.0  8.0  9.0 13.0 14.0
> 
> === Rank 3 === (4 elements)
> Packed: 18.0 19.0 23.0 24.0
> Read:    0.0  1.0  0.0  1.0
> 
> 
> 
> My interpretation is that the behaviour with OMPIO is correct.
> Interestingly everything matches up using both ROMIO and OMPIO if I set the
> block shape to 2x2.
> 
> This was run on OS X using 1.8.2a1r31632. I have also run this on Linux
> with OpenMPI 1.7.4, and OMPIO is still correct, but using ROMIO I just get
> segfaults.
> 
> Thanks,
> Richard
> -------------- next part --------------
> HTML attachment scrubbed and removed
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: darr_read.c
> Type: text/x-csrc
> Size: 2218 bytes
> Desc: not available
> URL: 
> <http://www.open-mpi.org/MailArchives/users/attachments/20140505/5a5ab0ba/attachment.bin>
> 
> ------------------------------
> 
> Message: 3
> Date: Tue, 06 May 2014 13:24:35 +0200
> From: Imran Ali <imra...@student.matnat.uio.no>
> To: <us...@open-mpi.org>
> Subject: [OMPI users] MPI File Open does not work
> Message-ID: <d57bdf499c00360b737205b085c50...@ulrik.uio.no>
> Content-Type: text/plain; charset="utf-8"
> 
> 
> 
> I get the following error when I try to run the following python
> code
> import mpi4py.MPI as MPI
> comm = MPI.COMM_WORLD
> 
> MPI.File.Open(comm,"some.file")
> 
> $ mpirun -np 1 python
> test_mpi.py
> Traceback (most recent call last):
>  File "test_mpi.py", line
> 3, in <module>
>  MPI.File.Open(comm," h5ex_d_alloc.h5")
>  File "File.pyx",
> line 67, in mpi4py.MPI.File.Open
> (src/mpi4py.MPI.c:89639)
> mpi4py.MPI.Exception: MPI_ERR_OTHER: known
> error not in
> list
> --------------------------------------------------------------------------
> mpirun
> noticed that the job aborted, but has no info as to the process
> that
> caused that
> situation.
> --------------------------------------------------------------------------
> 
> 
> My mpirun version is (Open MPI) 1.6.2. I installed openmpi using the
> dorsal script (https://github.com/FEniCS/dorsal) for Redhat Enterprise 6
> (OS I am using, release 6.5) . It configured the build as following :
> 
> 
> ./configure --enable-mpi-thread-multiple --enable-opal-multi-threads
> --with-threads=posix --disable-mpi-profile
> 
> I need emphasize that I do
> not have root acces on the system I am running my application.
> 
> Imran
> 
> 
> 
> -------------- next part --------------
> HTML attachment scrubbed and removed
> 
> ------------------------------
> 
> Message: 4
> Date: Tue, 6 May 2014 12:56:04 +0000
> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com>
> To: Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] MPI File Open does not work
> Message-ID: <e7df28cb-d4fb-4087-928e-18e61d1d2...@cisco.com>
> Content-Type: text/plain; charset="us-ascii"
> 
> The thread support in the 1.6 series is not very good.  You might try:
> 
> - Upgrading to 1.6.5
> - Or better yet, upgrading to 1.8.1
> 
> 
> On May 6, 2014, at 7:24 AM, Imran Ali <imra...@student.matnat.uio.no> wrote:
> 
> > I get the following error when I try to run the following python code
> >
> > import mpi4py.MPI as MPI
> > comm =  MPI.COMM_WORLD
> > MPI.File.Open(comm,"some.file")
> >
> > $ mpirun -np 1 python test_mpi.py
> > Traceback (most recent call last):
> >   File "test_mpi.py", line 3, in <module>
> >     MPI.File.Open(comm," h5ex_d_alloc.h5")
> >   File "File.pyx", line 67, in mpi4py.MPI.File.Open (src/mpi4py.MPI.c:89639)
> > mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
> > --------------------------------------------------------------------------
> > mpirun noticed that the job aborted, but has no info as to the process
> > that caused that situation.
> > --------------------------------------------------------------------------
> >
> > My mpirun version is (Open MPI) 1.6.2. I installed openmpi using the dorsal 
> > script (https://github.com/FEniCS/dorsal) for Redhat Enterprise 6 (OS I am 
> > using, release 6.5) . It configured the build as following :
> >
> > ./configure --enable-mpi-thread-multiple --enable-opal-multi-threads 
> > --with-threads=posix --disable-mpi-profile
> >
> > I need emphasize that I do not have root acces on the system I am running 
> > my application.
> >
> > Imran
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> ------------------------------
> 
> Message: 5
> Date: Tue, 6 May 2014 15:32:21 +0200
> From: Imran Ali <imra...@student.matnat.uio.no>
> To: Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] MPI File Open does not work
> Message-ID: <fa6dffff-6c66-4a47-84fc-148fb51ce...@math.uio.no>
> Content-Type: text/plain; charset=us-ascii
> 
> 
> 6. mai 2014 kl. 14:56 skrev Jeff Squyres (jsquyres) <jsquy...@cisco.com>:
> 
> > The thread support in the 1.6 series is not very good.  You might try:
> >
> > - Upgrading to 1.6.5
> > - Or better yet, upgrading to 1.8.1
> >
> 
> I will attempt that than. I read at
> 
> http://www.open-mpi.org/faq/?category=building#install-overwrite
> 
> that I should completely uninstall my previous version. Could you recommend 
> to me how I can go about doing it (without root access).
> I am uncertain where I can use make uninstall.
> 
> Imran
> 
> >
> > On May 6, 2014, at 7:24 AM, Imran Ali <imra...@student.matnat.uio.no> wrote:
> >
> >> I get the following error when I try to run the following python code
> >>
> >> import mpi4py.MPI as MPI
> >> comm =  MPI.COMM_WORLD
> >> MPI.File.Open(comm,"some.file")
> >>
> >> $ mpirun -np 1 python test_mpi.py
> >> Traceback (most recent call last):
> >>  File "test_mpi.py", line 3, in <module>
> >>    MPI.File.Open(comm," h5ex_d_alloc.h5")
> >>  File "File.pyx", line 67, in mpi4py.MPI.File.Open (src/mpi4py.MPI.c:89639)
> >> mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
> >> --------------------------------------------------------------------------
> >> mpirun noticed that the job aborted, but has no info as to the process
> >> that caused that situation.
> >> --------------------------------------------------------------------------
> >>
> >> My mpirun version is (Open MPI) 1.6.2. I installed openmpi using the 
> >> dorsal script (https://github.com/FEniCS/dorsal) for Redhat Enterprise 6 
> >> (OS I am using, release 6.5) . It configured the build as following :
> >>
> >> ./configure --enable-mpi-thread-multiple --enable-opal-multi-threads 
> >> --with-threads=posix --disable-mpi-profile
> >>
> >> I need emphasize that I do not have root acces on the system I am running 
> >> my application.
> >>
> >> Imran
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to: 
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> ------------------------------
> 
> Message: 6
> Date: Tue, 6 May 2014 13:34:38 +0000
> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com>
> To: Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] MPI File Open does not work
> Message-ID: <2a933c0e-80f6-4ded-b44c-53b5f37ef...@cisco.com>
> Content-Type: text/plain; charset="us-ascii"
> 
> On May 6, 2014, at 9:32 AM, Imran Ali <imra...@student.matnat.uio.no> wrote:
> 
> > I will attempt that than. I read at
> >
> > http://www.open-mpi.org/faq/?category=building#install-overwrite
> >
> > that I should completely uninstall my previous version.
> 
> Yes, that is best.  OR: you can install into a whole separate tree and ignore 
> the first installation.
> 
> > Could you recommend to me how I can go about doing it (without root access).
> > I am uncertain where I can use make uninstall.
> 
> If you don't have write access into the installation tree (i.e., it was 
> installed via root and you don't have root access), then your best bet is 
> simply to install into a new tree.  E.g., if OMPI is installed into 
> /opt/openmpi-1.6.2, try installing into /opt/openmpi-1.6.5, or even 
> $HOME/installs/openmpi-1.6.5, or something like that.
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> ------------------------------
> 
> Message: 7
> Date: Tue, 6 May 2014 15:40:34 +0200
> From: Imran Ali <imra...@student.matnat.uio.no>
> To: Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] MPI File Open does not work
> Message-ID: <14f0596c-c5c5-4b1a-a4a8-8849d44ab...@math.uio.no>
> Content-Type: text/plain; charset=us-ascii
> 
> 
> 6. mai 2014 kl. 15:34 skrev Jeff Squyres (jsquyres) <jsquy...@cisco.com>:
> 
> > On May 6, 2014, at 9:32 AM, Imran Ali <imra...@student.matnat.uio.no> wrote:
> >
> >> I will attempt that than. I read at
> >>
> >> http://www.open-mpi.org/faq/?category=building#install-overwrite
> >>
> >> that I should completely uninstall my previous version.
> >
> > Yes, that is best.  OR: you can install into a whole separate tree and 
> > ignore the first installation.
> >
> >> Could you recommend to me how I can go about doing it (without root 
> >> access).
> >> I am uncertain where I can use make uninstall.
> >
> > If you don't have write access into the installation tree (i.e., it was 
> > installed via root and you don't have root access), then your best bet is 
> > simply to install into a new tree.  E.g., if OMPI is installed into 
> > /opt/openmpi-1.6.2, try installing into /opt/openmpi-1.6.5, or even 
> > $HOME/installs/openmpi-1.6.5, or something like that.
> 
> My install was in my user directory (i.e $HOME). I managed to locate the 
> source directory and successfully run make uninstall.
> 
> Will let you know how things went after installation.
> 
> Imran
> 
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to: 
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> ------------------------------
> 
> Message: 8
> Date: Tue, 6 May 2014 14:42:52 +0000
> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com>
> To: Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] MPI File Open does not work
> Message-ID: <710e3328-edaa-4a13-9f07-b45fe3191...@cisco.com>
> Content-Type: text/plain; charset="us-ascii"
> 
> On May 6, 2014, at 9:40 AM, Imran Ali <imra...@student.matnat.uio.no> wrote:
> 
> > My install was in my user directory (i.e $HOME). I managed to locate the 
> > source directory and successfully run make uninstall.
> 
> 
> FWIW, I usually install Open MPI into its own subdir.  E.g., 
> $HOME/installs/openmpi-x.y.z.  Then if I don't want that install any more, I 
> can just "rm -rf $HOME/installs/openmpi-x.y.z" -- no need to "make 
> uninstall".  Specifically: if there's nothing else installed in the same tree 
> as Open MPI, you can just rm -rf its installation tree.
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> ------------------------------
> 
> Message: 9
> Date: Tue, 6 May 2014 14:50:34 +0000
> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com>
> To: Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] users Digest, Vol 2879, Issue 1
> Message-ID: <c60aa7e1-96a7-4ccd-9b5b-11a38fb87...@cisco.com>
> Content-Type: text/plain; charset="us-ascii"
> 
> Are you using TCP as the MPI transport?
> 
> If so, another thing to try is to limit the IP interfaces that MPI uses for 
> its traffic to see if there's some kind of problem with specific networks.
> 
> For example:
> 
>    mpirun --mca btl_tcp_if_include eth0 ...
> 
> If that works, then try adding in any/all other IP interfaces that you have 
> on your machines.
> 
> A sorta-wild guess: you have some IP interfaces that aren't working, or at 
> least, don't work in the way that OMPI wants them to work.  So the first 
> barrier works because it flows across eth0 (or some other first network that, 
> as far as OMPI is concerned, works just fine).  But then the next barrier 
> round-robin advances to the next IP interface, and it doesn't work for some 
> reason.
> 
> We've seen virtual machine bridge interfaces cause problems, for example.  
> E.g., if a machine has a Xen virtual machine interface (vibr0, IIRC?), then 
> OMPI will try to use it to communicate with peer MPI processes because it has 
> a "compatible" IP address, and OMPI think it should be connected/reachable to 
> peers.  If this is the case, you might want to disable such interfaces and/or 
> use btl_tcp_if_include or btl_tcp_if_exclude to select the interfaces that 
> you want to use.
> 
> Pro tip: if you use btl_tcp_if_exclude, remember to exclude the loopback 
> interface, too.  OMPI defaults to a btl_tcp_if_include="" (blank) and 
> btl_tcp_if_exclude="lo0". So if you override btl_tcp_if_exclude, you need to 
> be sure to *also* include lo0 in the new value.  For example:
> 
>    mpirun --mca btl_tcp_if_exclude lo0,virb0 ...
> 
> Also, if possible, try upgrading to Open MPI 1.8.1.
> 
> 
> 
> On May 4, 2014, at 2:15 PM, Clay Kirkland <clay.kirkl...@versityinc.com> 
> wrote:
> 
> >  I am configuring with all defaults.   Just doing a ./configure and then
> > make and make install.   I have used open mpi on several kinds of
> > unix  systems this way and have had no trouble before.   I believe I
> > last had success on a redhat version of linux.
> >
> >
> > On Sat, May 3, 2014 at 11:00 AM, <users-requ...@open-mpi.org> wrote:
> > Send users mailing list submissions to
> >         us...@open-mpi.org
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> >         http://www.open-mpi.org/mailman/listinfo.cgi/users
> > or, via email, send a message with subject or body 'help' to
> >         users-requ...@open-mpi.org
> >
> > You can reach the person managing the list at
> >         users-ow...@open-mpi.org
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of users digest..."
> >
> >
> > Today's Topics:
> >
> >    1. MPI_Barrier hangs on second attempt but only when multiple
> >       hosts used. (Clay Kirkland)
> >    2. Re: MPI_Barrier hangs on second attempt but only when
> >       multiple hosts used. (Ralph Castain)
> >
> >
> > ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Fri, 2 May 2014 16:24:04 -0500
> > From: Clay Kirkland <clay.kirkl...@versityinc.com>
> > To: us...@open-mpi.org
> > Subject: [OMPI users] MPI_Barrier hangs on second attempt but only
> >         when    multiple hosts used.
> > Message-ID:
> >         <CAJDnjA8Wi=FEjz6Vz+Bc34b+nFE=tf4b7g0bqgmbekg7h-p...@mail.gmail.com>
> > Content-Type: text/plain; charset="utf-8"
> >
> > I have been using MPI for many many years so I have very well debugged mpi
> > tests.   I am
> > having trouble on either openmpi-1.4.5  or  openmpi-1.6.5 versions though
> > with getting the
> > MPI_Barrier calls to work.   It works fine when I run all processes on one
> > machine but when
> > I run with two or more hosts the second call to MPI_Barrier always hangs.
> > Not the first one,
> > but always the second one.   I looked at FAQ's and such but found nothing
> > except for a comment
> > that MPI_Barrier problems were often problems with fire walls.  Also
> > mentioned as a problem
> > was not having the same version of mpi on both machines.  I turned
> > firewalls off and removed
> > and reinstalled the same version on both hosts but I still see the same
> > thing.   I then installed
> > lam mpi on two of my machines and that works fine.   I can call the
> > MPI_Barrier function when run on
> > one of two machines by itself  many times with no hangs.  Only hangs if two
> > or more hosts are involved.
> > These runs are all being done on CentOS release 6.4.   Here is test program
> > I used.
> >
> > main (argc, argv)
> > int argc;
> > char **argv;
> > {
> >     char message[20];
> >     char hoster[256];
> >     char nameis[256];
> >     int fd, i, j, jnp, iret, myrank, np, ranker, recker;
> >     MPI_Comm comm;
> >     MPI_Status status;
> >
> >     MPI_Init( &argc, &argv );
> >     MPI_Comm_rank( MPI_COMM_WORLD, &myrank);
> >     MPI_Comm_size( MPI_COMM_WORLD, &np);
> >
> >         gethostname(hoster,256);
> >
> >         printf(" In rank %d and host= %s  Do Barrier call
> > 1.\n",myrank,hoster);
> >     MPI_Barrier(MPI_COMM_WORLD);
> >         printf(" In rank %d and host= %s  Do Barrier call
> > 2.\n",myrank,hoster);
> >     MPI_Barrier(MPI_COMM_WORLD);
> >         printf(" In rank %d and host= %s  Do Barrier call
> > 3.\n",myrank,hoster);
> >     MPI_Barrier(MPI_COMM_WORLD);
> >     MPI_Finalize();
> >     exit(0);
> > }
> >
> >   Here are three runs of test program.  First with two processes on one
> > host, then with
> > two processes on another host, and finally with one process on each of two
> > hosts.  The
> > first two runs are fine but the last run hangs on the second MPI_Barrier.
> >
> > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out
> >  In rank 0 and host= centos  Do Barrier call 1.
> >  In rank 1 and host= centos  Do Barrier call 1.
> >  In rank 1 and host= centos  Do Barrier call 2.
> >  In rank 1 and host= centos  Do Barrier call 3.
> >  In rank 0 and host= centos  Do Barrier call 2.
> >  In rank 0 and host= centos  Do Barrier call 3.
> > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out
> > /root/.bashrc: line 14: unalias: ls: not found
> >  In rank 0 and host= RAID  Do Barrier call 1.
> >  In rank 0 and host= RAID  Do Barrier call 2.
> >  In rank 0 and host= RAID  Do Barrier call 3.
> >  In rank 1 and host= RAID  Do Barrier call 1.
> >  In rank 1 and host= RAID  Do Barrier call 2.
> >  In rank 1 and host= RAID  Do Barrier call 3.
> > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID a.out
> > /root/.bashrc: line 14: unalias: ls: not found
> >  In rank 0 and host= centos  Do Barrier call 1.
> >  In rank 0 and host= centos  Do Barrier call 2.
> > In rank 1 and host= RAID  Do Barrier call 1.
> >  In rank 1 and host= RAID  Do Barrier call 2.
> >
> >   Since it is such a simple test and problem and such a widely used MPI
> > function, it must obviously
> > be an installation or configuration problem.   A pstack for each of the
> > hung MPI_Barrier processes
> > on the two machines shows this:
> >
> > [root@centos ~]# pstack 31666
> > #0  0x0000003baf0e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> > #1  0x00007f5de06125eb in epoll_dispatch () from /usr/local/lib/libmpi.so.1
> > #2  0x00007f5de061475a in opal_event_base_loop () from
> > /usr/local/lib/libmpi.so.1
> > #3  0x00007f5de0639229 in opal_progress () from /usr/local/lib/libmpi.so.1
> > #4  0x00007f5de0586f75 in ompi_request_default_wait_all () from
> > /usr/local/lib/libmpi.so.1
> > #5  0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from
> > /usr/local/lib/openmpi/mca_coll_tuned.so
> > #6  0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs () from
> > /usr/local/lib/openmpi/mca_coll_tuned.so
> > #7  0x00007f5de05941c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1
> > #8  0x0000000000400a43 in main ()
> >
> > [root@RAID openmpi-1.6.5]# pstack 22167
> > #0  0x00000030302e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> > #1  0x00007f7ee46885eb in epoll_dispatch () from /usr/local/lib/libmpi.so.1
> > #2  0x00007f7ee468a75a in opal_event_base_loop () from
> > /usr/local/lib/libmpi.so.1
> > #3  0x00007f7ee46af229 in opal_progress () from /usr/local/lib/libmpi.so.1
> > #4  0x00007f7ee45fcf75 in ompi_request_default_wait_all () from
> > /usr/local/lib/libmpi.so.1
> > #5  0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from
> > /usr/local/lib/openmpi/mca_coll_tuned.so
> > #6  0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs () from
> > /usr/local/lib/openmpi/mca_coll_tuned.so
> > #7  0x00007f7ee460a1c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1
> > #8  0x0000000000400a43 in main ()
> >
> >  Which looks exactly the same on each machine.  Any thoughts or ideas would
> > be greatly appreciated as
> > I am stuck.
> >
> >  Clay Kirkland
> > -------------- next part --------------
> > HTML attachment scrubbed and removed
> >
> > ------------------------------
> >
> > Message: 2
> > Date: Sat, 3 May 2014 06:39:20 -0700
> > From: Ralph Castain <r...@open-mpi.org>
> > To: Open MPI Users <us...@open-mpi.org>
> > Subject: Re: [OMPI users] MPI_Barrier hangs on second attempt but only
> >         when    multiple hosts used.
> > Message-ID: <3cf53d73-15d9-40bb-a2de-50ba3561a...@open-mpi.org>
> > Content-Type: text/plain; charset="us-ascii"
> >
> > Hmmm...just testing on my little cluster here on two nodes, it works just 
> > fine with 1.8.2:
> >
> > [rhc@bend001 v1.8]$ mpirun -n 2 --map-by node ./a.out
> >  In rank 0 and host= bend001  Do Barrier call 1.
> >  In rank 0 and host= bend001  Do Barrier call 2.
> >  In rank 0 and host= bend001  Do Barrier call 3.
> >  In rank 1 and host= bend002  Do Barrier call 1.
> >  In rank 1 and host= bend002  Do Barrier call 2.
> >  In rank 1 and host= bend002  Do Barrier call 3.
> > [rhc@bend001 v1.8]$
> >
> >
> > How are you configuring OMPI?
> >
> >
> > On May 2, 2014, at 2:24 PM, Clay Kirkland <clay.kirkl...@versityinc.com> 
> > wrote:
> >
> > > I have been using MPI for many many years so I have very well debugged 
> > > mpi tests.   I am
> > > having trouble on either openmpi-1.4.5  or  openmpi-1.6.5 versions though 
> > > with getting the
> > > MPI_Barrier calls to work.   It works fine when I run all processes on 
> > > one machine but when
> > > I run with two or more hosts the second call to MPI_Barrier always hangs. 
> > >   Not the first one,
> > > but always the second one.   I looked at FAQ's and such but found nothing 
> > > except for a comment
> > > that MPI_Barrier problems were often problems with fire walls.  Also 
> > > mentioned as a problem
> > > was not having the same version of mpi on both machines.  I turned 
> > > firewalls off and removed
> > > and reinstalled the same version on both hosts but I still see the same 
> > > thing.   I then installed
> > > lam mpi on two of my machines and that works fine.   I can call the 
> > > MPI_Barrier function when run on
> > > one of two machines by itself  many times with no hangs.  Only hangs if 
> > > two or more hosts are involved.
> > > These runs are all being done on CentOS release 6.4.   Here is test 
> > > program I used.
> > >
> > > main (argc, argv)
> > > int argc;
> > > char **argv;
> > > {
> > >     char message[20];
> > >     char hoster[256];
> > >     char nameis[256];
> > >     int fd, i, j, jnp, iret, myrank, np, ranker, recker;
> > >     MPI_Comm comm;
> > >     MPI_Status status;
> > >
> > >     MPI_Init( &argc, &argv );
> > >     MPI_Comm_rank( MPI_COMM_WORLD, &myrank);
> > >     MPI_Comm_size( MPI_COMM_WORLD, &np);
> > >
> > >         gethostname(hoster,256);
> > >
> > >         printf(" In rank %d and host= %s  Do Barrier call 
> > > 1.\n",myrank,hoster);
> > >     MPI_Barrier(MPI_COMM_WORLD);
> > >         printf(" In rank %d and host= %s  Do Barrier call 
> > > 2.\n",myrank,hoster);
> > >     MPI_Barrier(MPI_COMM_WORLD);
> > >         printf(" In rank %d and host= %s  Do Barrier call 
> > > 3.\n",myrank,hoster);
> > >     MPI_Barrier(MPI_COMM_WORLD);
> > >     MPI_Finalize();
> > >     exit(0);
> > > }
> > >
> > >   Here are three runs of test program.  First with two processes on one 
> > > host, then with
> > > two processes on another host, and finally with one process on each of 
> > > two hosts.  The
> > > first two runs are fine but the last run hangs on the second MPI_Barrier.
> > >
> > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out
> > >  In rank 0 and host= centos  Do Barrier call 1.
> > >  In rank 1 and host= centos  Do Barrier call 1.
> > >  In rank 1 and host= centos  Do Barrier call 2.
> > >  In rank 1 and host= centos  Do Barrier call 3.
> > >  In rank 0 and host= centos  Do Barrier call 2.
> > >  In rank 0 and host= centos  Do Barrier call 3.
> > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out
> > > /root/.bashrc: line 14: unalias: ls: not found
> > >  In rank 0 and host= RAID  Do Barrier call 1.
> > >  In rank 0 and host= RAID  Do Barrier call 2.
> > >  In rank 0 and host= RAID  Do Barrier call 3.
> > >  In rank 1 and host= RAID  Do Barrier call 1.
> > >  In rank 1 and host= RAID  Do Barrier call 2.
> > >  In rank 1 and host= RAID  Do Barrier call 3.
> > > [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos,RAID a.out
> > > /root/.bashrc: line 14: unalias: ls: not found
> > >  In rank 0 and host= centos  Do Barrier call 1.
> > >  In rank 0 and host= centos  Do Barrier call 2.
> > > In rank 1 and host= RAID  Do Barrier call 1.
> > >  In rank 1 and host= RAID  Do Barrier call 2.
> > >
> > >   Since it is such a simple test and problem and such a widely used MPI 
> > > function, it must obviously
> > > be an installation or configuration problem.   A pstack for each of the 
> > > hung MPI_Barrier processes
> > > on the two machines shows this:
> > >
> > > [root@centos ~]# pstack 31666
> > > #0  0x0000003baf0e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> > > #1  0x00007f5de06125eb in epoll_dispatch () from 
> > > /usr/local/lib/libmpi.so.1
> > > #2  0x00007f5de061475a in opal_event_base_loop () from 
> > > /usr/local/lib/libmpi.so.1
> > > #3  0x00007f5de0639229 in opal_progress () from /usr/local/lib/libmpi.so.1
> > > #4  0x00007f5de0586f75 in ompi_request_default_wait_all () from 
> > > /usr/local/lib/libmpi.so.1
> > > #5  0x00007f5ddc59565e in ompi_coll_tuned_sendrecv_actual () from 
> > > /usr/local/lib/openmpi/mca_coll_tuned.so
> > > #6  0x00007f5ddc59d8ff in ompi_coll_tuned_barrier_intra_two_procs () from 
> > > /usr/local/lib/openmpi/mca_coll_tuned.so
> > > #7  0x00007f5de05941c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1
> > > #8  0x0000000000400a43 in main ()
> > >
> > > [root@RAID openmpi-1.6.5]# pstack 22167
> > > #0  0x00000030302e8ee3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> > > #1  0x00007f7ee46885eb in epoll_dispatch () from 
> > > /usr/local/lib/libmpi.so.1
> > > #2  0x00007f7ee468a75a in opal_event_base_loop () from 
> > > /usr/local/lib/libmpi.so.1
> > > #3  0x00007f7ee46af229 in opal_progress () from /usr/local/lib/libmpi.so.1
> > > #4  0x00007f7ee45fcf75 in ompi_request_default_wait_all () from 
> > > /usr/local/lib/libmpi.so.1
> > > #5  0x00007f7ee060b65e in ompi_coll_tuned_sendrecv_actual () from 
> > > /usr/local/lib/openmpi/mca_coll_tuned.so
> > > #6  0x00007f7ee06138ff in ompi_coll_tuned_barrier_intra_two_procs () from 
> > > /usr/local/lib/openmpi/mca_coll_tuned.so
> > > #7  0x00007f7ee460a1c2 in PMPI_Barrier () from /usr/local/lib/libmpi.so.1
> > > #8  0x0000000000400a43 in main ()
> > >
> > >  Which looks exactly the same on each machine.  Any thoughts or ideas 
> > > would be greatly appreciated as
> > > I am stuck.
> > >
> > >  Clay Kirkland
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > -------------- next part --------------
> > HTML attachment scrubbed and removed
> >
> > ------------------------------
> >
> > Subject: Digest Footer
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > ------------------------------
> >
> > End of users Digest, Vol 2879, Issue 1
> > **************************************
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> ------------------------------
> 
> Subject: Digest Footer
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ------------------------------
> 
> End of users Digest, Vol 2881, Issue 1
> **************************************
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to