Re: [OMPI users] users Digest, Vol 2881, Issue 4

2014-05-06 Thread Ralph Castain

On May 6, 2014, at 6:24 PM, Clay Kirkland  wrote:

>  Got it to work finally.  The longer line doesn't work.
> 
> But if I take off the -mca oob_tcp_if_include 192.168.0.0/16 part then 
> everything works from
> every combination of machines I have.

Interesting - I'm surprised, but glad it worked

> 
> And as to any MPI having trouble, in my original posting I stated that I 
> installed lam mpi
> on the same hardware and it worked just fine.   Maybe you guys should look at 
> what they
> do and copy it.   Virtually every machine I have used in the last 5 years has 
> multiple nic
> interfaces and almost all of them are set up to use only 1 interface.   It 
> seems odd to have
> a product that is designed to lash together multiple machines and have it 
> fail with default
> install on generic machines.

Actually, we are the "lam mpi" guys :-)

There clearly is a bug in the connection logic, but a little hint will work it 
thru until we can resolve it.

>  
>   But software is like that some time and I want to thank you  much for all 
> the help.   Please 
> take my criticism with a grain of salt.   I love MPI, I just want to see it 
> work.   I have been
> using it for 20 some years to synchronize multiple machines for I/O testing 
> and it is one
> slick product for that.   It has helped us find many bugs in shared files 
> systems.  Thanks 
> again,

No problem!

> 
> 
> 
> 
> On Tue, May 6, 2014 at 7:45 PM,  wrote:
> Send users mailing list submissions to
> us...@open-mpi.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@open-mpi.org
> 
> You can reach the person managing the list at
> users-ow...@open-mpi.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
> 
> 
> Today's Topics:
> 
>1. Re: users Digest, Vol 2881, Issue 2 (Ralph Castain)
> 
> 
> --
> 
> Message: 1
> Date: Tue, 6 May 2014 17:45:09 -0700
> From: Ralph Castain 
> To: Open MPI Users 
> Subject: Re: [OMPI users] users Digest, Vol 2881, Issue 2
> Message-ID: <4b207e61-952a-4744-9a7b-0704c4b0d...@open-mpi.org>
> Content-Type: text/plain; charset="us-ascii"
> 
> -mca btl_tcp_if_include 192.168.0.0/16 -mca oob_tcp_if_include 192.168.0.0/16
> 
> should do the trick. Any MPI is going to have trouble with your arrangement - 
> just need a little hint to help figure it out.
> 
> 
> On May 6, 2014, at 5:14 PM, Clay Kirkland  
> wrote:
> 
> >  Someone suggested using some network address if all machines are on same 
> > subnet.
> > They are all on the same subnet, I think.   I have no idea what to put for 
> > a param there.
> > I tried the ethernet address but of course it couldn't be that simple.  
> > Here are my ifconfig
> > outputs from a couple of machines:
> >
> > [root@RAID MPI]# ifconfig -a
> > eth0  Link encap:Ethernet  HWaddr 00:25:90:73:2A:36
> >   inet addr:192.168.0.59  Bcast:192.168.0.255  Mask:255.255.255.0
> >   inet6 addr: fe80::225:90ff:fe73:2a36/64 Scope:Link
> >   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >   RX packets:17983 errors:0 dropped:0 overruns:0 frame:0
> >   TX packets:9952 errors:0 dropped:0 overruns:0 carrier:0
> >   collisions:0 txqueuelen:1000
> >   RX bytes:26309771 (25.0 MiB)  TX bytes:758940 (741.1 KiB)
> >   Interrupt:16 Memory:fbde-fbe0
> >
> > eth1  Link encap:Ethernet  HWaddr 00:25:90:73:2A:37
> >   inet6 addr: fe80::225:90ff:fe73:2a37/64 Scope:Link
> >   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >   RX packets:56 errors:0 dropped:0 overruns:0 frame:0
> >   TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
> >   collisions:0 txqueuelen:1000
> >   RX bytes:3924 (3.8 KiB)  TX bytes:468 (468.0 b)
> >   Interrupt:17 Memory:fbee-fbf0
> >
> >  And from one that I can't get to work:
> >
> > [root@centos ~]# ifconfig -a
> > eth0  Link encap:Ethernet  HWaddr 00:1E:4F:FB:30:34
> >   inet6 addr: fe80::21e:4fff:fefb:3034/64 Scope:Link
> >   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >   RX packets:45 errors:0 dropped:0 overruns:0 frame:0
> >   TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
> >   collisions:0 txqueuelen:1000
> >   RX bytes:2700 (2.6 KiB)  TX bytes:468 (468.0 b)
> >   Interrupt:21 Memory:fe9e-fea0
> >
> > eth1  Link encap:Ethernet  HWaddr 00:14:D1:22:8E:50
> >   inet addr:192.168.0.154  Bcast:192.168.0.255  Mask:255.255.255.0
> >   inet6 addr: fe80::214:d1ff:fe22:8e50/64 Scope:Link
> >   UP 

Re: [OMPI users] users Digest, Vol 2881, Issue 4

2014-05-06 Thread Clay Kirkland
 Got it to work finally.  The longer line doesn't work.

But if I take off the -mca oob_tcp_if_include 192.168.0.0/16 part then
everything works from
every combination of machines I have.

And as to any MPI having trouble, in my original posting I stated that I
installed lam mpi
on the same hardware and it worked just fine.   Maybe you guys should look
at what they
do and copy it.   Virtually every machine I have used in the last 5 years
has multiple nic
interfaces and almost all of them are set up to use only 1 interface.   It
seems odd to have
a product that is designed to lash together multiple machines and have it
fail with default
install on generic machines.

  But software is like that some time and I want to thank you  much for all
the help.   Please
take my criticism with a grain of salt.   I love MPI, I just want to see it
work.   I have been
using it for 20 some years to synchronize multiple machines for I/O testing
and it is one
slick product for that.   It has helped us find many bugs in shared files
systems.  Thanks
again,




On Tue, May 6, 2014 at 7:45 PM,  wrote:

> Send users mailing list submissions to
> us...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@open-mpi.org
>
> You can reach the person managing the list at
> users-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>1. Re: users Digest, Vol 2881, Issue 2 (Ralph Castain)
>
>
> --
>
> Message: 1
> Date: Tue, 6 May 2014 17:45:09 -0700
> From: Ralph Castain 
> To: Open MPI Users 
> Subject: Re: [OMPI users] users Digest, Vol 2881, Issue 2
> Message-ID: <4b207e61-952a-4744-9a7b-0704c4b0d...@open-mpi.org>
> Content-Type: text/plain; charset="us-ascii"
>
> -mca btl_tcp_if_include 192.168.0.0/16 -mca oob_tcp_if_include
> 192.168.0.0/16
>
> should do the trick. Any MPI is going to have trouble with your
> arrangement - just need a little hint to help figure it out.
>
>
> On May 6, 2014, at 5:14 PM, Clay Kirkland 
> wrote:
>
> >  Someone suggested using some network address if all machines are on
> same subnet.
> > They are all on the same subnet, I think.   I have no idea what to put
> for a param there.
> > I tried the ethernet address but of course it couldn't be that simple.
>  Here are my ifconfig
> > outputs from a couple of machines:
> >
> > [root@RAID MPI]# ifconfig -a
> > eth0  Link encap:Ethernet  HWaddr 00:25:90:73:2A:36
> >   inet addr:192.168.0.59  Bcast:192.168.0.255  Mask:255.255.255.0
> >   inet6 addr: fe80::225:90ff:fe73:2a36/64 Scope:Link
> >   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >   RX packets:17983 errors:0 dropped:0 overruns:0 frame:0
> >   TX packets:9952 errors:0 dropped:0 overruns:0 carrier:0
> >   collisions:0 txqueuelen:1000
> >   RX bytes:26309771 (25.0 MiB)  TX bytes:758940 (741.1 KiB)
> >   Interrupt:16 Memory:fbde-fbe0
> >
> > eth1  Link encap:Ethernet  HWaddr 00:25:90:73:2A:37
> >   inet6 addr: fe80::225:90ff:fe73:2a37/64 Scope:Link
> >   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >   RX packets:56 errors:0 dropped:0 overruns:0 frame:0
> >   TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
> >   collisions:0 txqueuelen:1000
> >   RX bytes:3924 (3.8 KiB)  TX bytes:468 (468.0 b)
> >   Interrupt:17 Memory:fbee-fbf0
> >
> >  And from one that I can't get to work:
> >
> > [root@centos ~]# ifconfig -a
> > eth0  Link encap:Ethernet  HWaddr 00:1E:4F:FB:30:34
> >   inet6 addr: fe80::21e:4fff:fefb:3034/64 Scope:Link
> >   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >   RX packets:45 errors:0 dropped:0 overruns:0 frame:0
> >   TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
> >   collisions:0 txqueuelen:1000
> >   RX bytes:2700 (2.6 KiB)  TX bytes:468 (468.0 b)
> >   Interrupt:21 Memory:fe9e-fea0
> >
> > eth1  Link encap:Ethernet  HWaddr 00:14:D1:22:8E:50
> >   inet addr:192.168.0.154  Bcast:192.168.0.255
>  Mask:255.255.255.0
> >   inet6 addr: fe80::214:d1ff:fe22:8e50/64 Scope:Link
> >   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >   RX packets:160 errors:0 dropped:0 overruns:0 frame:0
> >   TX packets:120 errors:0 dropped:0 overruns:0 carrier:0
> >   collisions:0 txqueuelen:1000
> >   RX bytes:31053 (30.3 KiB)  TX bytes:18897 (18.4 KiB)
> >   Interrupt:16 Base address:0x2f00
> >
> >
> >  The centos machine is using eth1 and not 

Re: [OMPI users] users Digest, Vol 2881, Issue 2

2014-05-06 Thread Ralph Castain
-mca btl_tcp_if_include 192.168.0.0/16 -mca oob_tcp_if_include 192.168.0.0/16

should do the trick. Any MPI is going to have trouble with your arrangement - 
just need a little hint to help figure it out.


On May 6, 2014, at 5:14 PM, Clay Kirkland  wrote:

>  Someone suggested using some network address if all machines are on same 
> subnet.
> They are all on the same subnet, I think.   I have no idea what to put for a 
> param there.
> I tried the ethernet address but of course it couldn't be that simple.  Here 
> are my ifconfig
> outputs from a couple of machines:
> 
> [root@RAID MPI]# ifconfig -a
> eth0  Link encap:Ethernet  HWaddr 00:25:90:73:2A:36  
>   inet addr:192.168.0.59  Bcast:192.168.0.255  Mask:255.255.255.0
>   inet6 addr: fe80::225:90ff:fe73:2a36/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>   RX packets:17983 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:9952 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000 
>   RX bytes:26309771 (25.0 MiB)  TX bytes:758940 (741.1 KiB)
>   Interrupt:16 Memory:fbde-fbe0 
> 
> eth1  Link encap:Ethernet  HWaddr 00:25:90:73:2A:37  
>   inet6 addr: fe80::225:90ff:fe73:2a37/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>   RX packets:56 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000 
>   RX bytes:3924 (3.8 KiB)  TX bytes:468 (468.0 b)
>   Interrupt:17 Memory:fbee-fbf0 
> 
>  And from one that I can't get to work:
> 
> [root@centos ~]# ifconfig -a
> eth0  Link encap:Ethernet  HWaddr 00:1E:4F:FB:30:34  
>   inet6 addr: fe80::21e:4fff:fefb:3034/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>   RX packets:45 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000 
>   RX bytes:2700 (2.6 KiB)  TX bytes:468 (468.0 b)
>   Interrupt:21 Memory:fe9e-fea0 
> 
> eth1  Link encap:Ethernet  HWaddr 00:14:D1:22:8E:50  
>   inet addr:192.168.0.154  Bcast:192.168.0.255  Mask:255.255.255.0
>   inet6 addr: fe80::214:d1ff:fe22:8e50/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>   RX packets:160 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:120 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000 
>   RX bytes:31053 (30.3 KiB)  TX bytes:18897 (18.4 KiB)
>   Interrupt:16 Base address:0x2f00 
> 
>  
>  The centos machine is using eth1 and not eth0, therein lies the problem.   
> 
>  I don't really need all this optimization of using multiple ethernet 
> adaptors to speed things
> up.   I am just using MPI to synchronize I/O tests.   Can I go back to a 
> really old version 
> and avoid all this pain full debugging???
> 
>  
> 
> 
> On Tue, May 6, 2014 at 6:50 PM,  wrote:
> Send users mailing list submissions to
> us...@open-mpi.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@open-mpi.org
> 
> You can reach the person managing the list at
> users-ow...@open-mpi.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
> 
> 
> Today's Topics:
> 
>1. Re: users Digest, Vol 2881, Issue 1 (Clay Kirkland)
>2. Re: users Digest, Vol 2881, Issue 1 (Clay Kirkland)
> 
> 
> --
> 
> Message: 1
> Date: Tue, 6 May 2014 18:28:59 -0500
> From: Clay Kirkland 
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] users Digest, Vol 2881, Issue 1
> Message-ID:
> 
> Content-Type: text/plain; charset="utf-8"
> 
>  That last trick seems to work.  I can get it to work once in a while with
> those tcp options but it is
> tricky as I have three machines and two of them use eth0 as primary network
> interface and one
> uses eth1.   But by fiddling with network options and perhaps moving a
> cable or two I think I can
> get it all to workThanks much for the tip.
> 
>  Clay
> 
> 
> On Tue, May 6, 2014 at 11:00 AM,  wrote:
> 
> > Send users mailing list submissions to
> > us...@open-mpi.org
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > or, via email, send a message with subject or body 'help' to
> > users-requ...@open-mpi.org
> >
> > 

Re: [OMPI users] users Digest, Vol 2881, Issue 2

2014-05-06 Thread Clay Kirkland
 Someone suggested using some network address if all machines are on same
subnet.
They are all on the same subnet, I think.   I have no idea what to put for
a param there.
I tried the ethernet address but of course it couldn't be that simple.
Here are my ifconfig
outputs from a couple of machines:

[root@RAID MPI]# ifconfig -a
eth0  Link encap:Ethernet  HWaddr 00:25:90:73:2A:36
  inet addr:192.168.0.59  Bcast:192.168.0.255  Mask:255.255.255.0
  inet6 addr: fe80::225:90ff:fe73:2a36/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:17983 errors:0 dropped:0 overruns:0 frame:0
  TX packets:9952 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:26309771 (25.0 MiB)  TX bytes:758940 (741.1 KiB)
  Interrupt:16 Memory:fbde-fbe0

eth1  Link encap:Ethernet  HWaddr 00:25:90:73:2A:37
  inet6 addr: fe80::225:90ff:fe73:2a37/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:56 errors:0 dropped:0 overruns:0 frame:0
  TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:3924 (3.8 KiB)  TX bytes:468 (468.0 b)
  Interrupt:17 Memory:fbee-fbf0

 And from one that I can't get to work:

[root@centos ~]# ifconfig -a
eth0  Link encap:Ethernet  HWaddr 00:1E:4F:FB:30:34
  inet6 addr: fe80::21e:4fff:fefb:3034/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:45 errors:0 dropped:0 overruns:0 frame:0
  TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:2700 (2.6 KiB)  TX bytes:468 (468.0 b)
  Interrupt:21 Memory:fe9e-fea0

eth1  Link encap:Ethernet  HWaddr 00:14:D1:22:8E:50
  inet addr:192.168.0.154  Bcast:192.168.0.255  Mask:255.255.255.0
  inet6 addr: fe80::214:d1ff:fe22:8e50/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:160 errors:0 dropped:0 overruns:0 frame:0
  TX packets:120 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:31053 (30.3 KiB)  TX bytes:18897 (18.4 KiB)
  Interrupt:16 Base address:0x2f00


 The centos machine is using eth1 and not eth0, therein lies the problem.

 I don't really need all this optimization of using multiple ethernet
adaptors to speed things
up.   I am just using MPI to synchronize I/O tests.   Can I go back to a
really old version
and avoid all this pain full debugging???




On Tue, May 6, 2014 at 6:50 PM,  wrote:

> Send users mailing list submissions to
> us...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@open-mpi.org
>
> You can reach the person managing the list at
> users-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>1. Re: users Digest, Vol 2881, Issue 1 (Clay Kirkland)
>2. Re: users Digest, Vol 2881, Issue 1 (Clay Kirkland)
>
>
> --
>
> Message: 1
> Date: Tue, 6 May 2014 18:28:59 -0500
> From: Clay Kirkland 
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] users Digest, Vol 2881, Issue 1
> Message-ID:
> <
> cajdnja90buhwu_ihssnna1a4p35+o96rrxk19xnhwo-nsd_...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
>  That last trick seems to work.  I can get it to work once in a while with
> those tcp options but it is
> tricky as I have three machines and two of them use eth0 as primary network
> interface and one
> uses eth1.   But by fiddling with network options and perhaps moving a
> cable or two I think I can
> get it all to workThanks much for the tip.
>
>  Clay
>
>
> On Tue, May 6, 2014 at 11:00 AM,  wrote:
>
> > Send users mailing list submissions to
> > us...@open-mpi.org
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > or, via email, send a message with subject or body 'help' to
> > users-requ...@open-mpi.org
> >
> > You can reach the person managing the list at
> > users-ow...@open-mpi.org
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of users digest..."
> >
> >
> > Today's Topics:
> >
> >1. Re: MPI_Barrier hangs on second attempt but only  when
> >   multiple hosts used. (Daniels, Marcus G)
> >2. ROMIO bug reading darrays (Richard Shaw)
> >3. MPI File Open does not work (Imran Ali)
> 

Re: [OMPI users] users Digest, Vol 2881, Issue 1

2014-05-06 Thread Ralph Castain
Are these NICs on the same IP subnet, per chance? You don't have to specify 
them by name - you can say "-mca btl_tcp_if_include 10.10/16" or something.

On May 6, 2014, at 4:50 PM, Clay Kirkland  wrote:

>  Well it turns out  I can't seem to get all three of my machines on the same 
> page.
> Two of them are using eth0 and one is using eth1.   Centos seems unable to 
> bring 
> up multiple network interfaces for some reason and when I use the mca param 
> to 
> use eth0 it works on two machines but not the other.   Is there some way to 
> use 
> only eth1 on one host and only eth0 on the other two?   Maybe environment 
> variables
> but I can't seem to get that to work either.
> 
>  Clay
> 
> 
> On Tue, May 6, 2014 at 6:28 PM, Clay Kirkland  
> wrote:
>  That last trick seems to work.  I can get it to work once in a while with 
> those tcp options but it is
> tricky as I have three machines and two of them use eth0 as primary network 
> interface and one 
> uses eth1.   But by fiddling with network options and perhaps moving a cable 
> or two I think I can
> get it all to workThanks much for the tip.
> 
>  Clay
> 
> 
> On Tue, May 6, 2014 at 11:00 AM,  wrote:
> Send users mailing list submissions to
> us...@open-mpi.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@open-mpi.org
> 
> You can reach the person managing the list at
> users-ow...@open-mpi.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
> 
> 
> Today's Topics:
> 
>1. Re: MPI_Barrier hangs on second attempt but only  when
>   multiple hosts used. (Daniels, Marcus G)
>2. ROMIO bug reading darrays (Richard Shaw)
>3. MPI File Open does not work (Imran Ali)
>4. Re: MPI File Open does not work (Jeff Squyres (jsquyres))
>5. Re: MPI File Open does not work (Imran Ali)
>6. Re: MPI File Open does not work (Jeff Squyres (jsquyres))
>7. Re: MPI File Open does not work (Imran Ali)
>8. Re: MPI File Open does not work (Jeff Squyres (jsquyres))
>9. Re: users Digest, Vol 2879, Issue 1 (Jeff Squyres (jsquyres))
> 
> 
> --
> 
> Message: 1
> Date: Mon, 5 May 2014 19:28:07 +
> From: "Daniels, Marcus G" 
> To: "'us...@open-mpi.org'" 
> Subject: Re: [OMPI users] MPI_Barrier hangs on second attempt but only
> whenmultiple hosts used.
> Message-ID:
> <532c594b7920a549a2a91cb4312cc57640dc5...@ecs-exg-p-mb01.win.lanl.gov>
> Content-Type: text/plain; charset="utf-8"
> 
> 
> 
> From: Clay Kirkland [mailto:clay.kirkl...@versityinc.com]
> Sent: Friday, May 02, 2014 03:24 PM
> To: us...@open-mpi.org 
> Subject: [OMPI users] MPI_Barrier hangs on second attempt but only when 
> multiple hosts used.
> 
> I have been using MPI for many many years so I have very well debugged mpi 
> tests.   I am
> having trouble on either openmpi-1.4.5  or  openmpi-1.6.5 versions though 
> with getting the
> MPI_Barrier calls to work.   It works fine when I run all processes on one 
> machine but when
> I run with two or more hosts the second call to MPI_Barrier always hangs.   
> Not the first one,
> but always the second one.   I looked at FAQ's and such but found nothing 
> except for a comment
> that MPI_Barrier problems were often problems with fire walls.  Also 
> mentioned as a problem
> was not having the same version of mpi on both machines.  I turned firewalls 
> off and removed
> and reinstalled the same version on both hosts but I still see the same 
> thing.   I then installed
> lam mpi on two of my machines and that works fine.   I can call the 
> MPI_Barrier function when run on
> one of two machines by itself  many times with no hangs.  Only hangs if two 
> or more hosts are involved.
> These runs are all being done on CentOS release 6.4.   Here is test program I 
> used.
> 
> main (argc, argv)
> int argc;
> char **argv;
> {
> char message[20];
> char hoster[256];
> char nameis[256];
> int fd, i, j, jnp, iret, myrank, np, ranker, recker;
> MPI_Comm comm;
> MPI_Status status;
> 
> MPI_Init( ,  );
> MPI_Comm_rank( MPI_COMM_WORLD, );
> MPI_Comm_size( MPI_COMM_WORLD, );
> 
> gethostname(hoster,256);
> 
> printf(" In rank %d and host= %s  Do Barrier call 
> 1.\n",myrank,hoster);
> MPI_Barrier(MPI_COMM_WORLD);
> printf(" In rank %d and host= %s  Do Barrier call 
> 2.\n",myrank,hoster);
> MPI_Barrier(MPI_COMM_WORLD);
> printf(" In rank %d and host= %s  Do Barrier call 
> 3.\n",myrank,hoster);
> MPI_Barrier(MPI_COMM_WORLD);
> MPI_Finalize();
> exit(0);
> }
> 
>   Here are three 

Re: [OMPI users] users Digest, Vol 2881, Issue 1

2014-05-06 Thread Clay Kirkland
 Well it turns out  I can't seem to get all three of my machines on the
same page.
Two of them are using eth0 and one is using eth1.   Centos seems unable to
bring
up multiple network interfaces for some reason and when I use the mca param
to
use eth0 it works on two machines but not the other.   Is there some way to
use
only eth1 on one host and only eth0 on the other two?   Maybe environment
variables
but I can't seem to get that to work either.

 Clay


On Tue, May 6, 2014 at 6:28 PM, Clay Kirkland
wrote:

>  That last trick seems to work.  I can get it to work once in a while with
> those tcp options but it is
> tricky as I have three machines and two of them use eth0 as primary
> network interface and one
> uses eth1.   But by fiddling with network options and perhaps moving a
> cable or two I think I can
> get it all to workThanks much for the tip.
>
>  Clay
>
>
> On Tue, May 6, 2014 at 11:00 AM,  wrote:
>
>> Send users mailing list submissions to
>> us...@open-mpi.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> or, via email, send a message with subject or body 'help' to
>> users-requ...@open-mpi.org
>>
>> You can reach the person managing the list at
>> users-ow...@open-mpi.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of users digest..."
>>
>>
>> Today's Topics:
>>
>>1. Re: MPI_Barrier hangs on second attempt but only  when
>>   multiple hosts used. (Daniels, Marcus G)
>>2. ROMIO bug reading darrays (Richard Shaw)
>>3. MPI File Open does not work (Imran Ali)
>>4. Re: MPI File Open does not work (Jeff Squyres (jsquyres))
>>5. Re: MPI File Open does not work (Imran Ali)
>>6. Re: MPI File Open does not work (Jeff Squyres (jsquyres))
>>7. Re: MPI File Open does not work (Imran Ali)
>>8. Re: MPI File Open does not work (Jeff Squyres (jsquyres))
>>9. Re: users Digest, Vol 2879, Issue 1 (Jeff Squyres (jsquyres))
>>
>>
>> --
>>
>> Message: 1
>> Date: Mon, 5 May 2014 19:28:07 +
>> From: "Daniels, Marcus G" 
>> To: "'us...@open-mpi.org'" 
>> Subject: Re: [OMPI users] MPI_Barrier hangs on second attempt but only
>> whenmultiple hosts used.
>> Message-ID:
>> <
>> 532c594b7920a549a2a91cb4312cc57640dc5...@ecs-exg-p-mb01.win.lanl.gov>
>> Content-Type: text/plain; charset="utf-8"
>>
>>
>>
>> From: Clay Kirkland [mailto:clay.kirkl...@versityinc.com]
>> Sent: Friday, May 02, 2014 03:24 PM
>> To: us...@open-mpi.org 
>> Subject: [OMPI users] MPI_Barrier hangs on second attempt but only when
>> multiple hosts used.
>>
>> I have been using MPI for many many years so I have very well debugged
>> mpi tests.   I am
>> having trouble on either openmpi-1.4.5  or  openmpi-1.6.5 versions though
>> with getting the
>> MPI_Barrier calls to work.   It works fine when I run all processes on
>> one machine but when
>> I run with two or more hosts the second call to MPI_Barrier always hangs.
>>   Not the first one,
>> but always the second one.   I looked at FAQ's and such but found nothing
>> except for a comment
>> that MPI_Barrier problems were often problems with fire walls.  Also
>> mentioned as a problem
>> was not having the same version of mpi on both machines.  I turned
>> firewalls off and removed
>> and reinstalled the same version on both hosts but I still see the same
>> thing.   I then installed
>> lam mpi on two of my machines and that works fine.   I can call the
>> MPI_Barrier function when run on
>> one of two machines by itself  many times with no hangs.  Only hangs if
>> two or more hosts are involved.
>> These runs are all being done on CentOS release 6.4.   Here is test
>> program I used.
>>
>> main (argc, argv)
>> int argc;
>> char **argv;
>> {
>> char message[20];
>> char hoster[256];
>> char nameis[256];
>> int fd, i, j, jnp, iret, myrank, np, ranker, recker;
>> MPI_Comm comm;
>> MPI_Status status;
>>
>> MPI_Init( ,  );
>> MPI_Comm_rank( MPI_COMM_WORLD, );
>> MPI_Comm_size( MPI_COMM_WORLD, );
>>
>> gethostname(hoster,256);
>>
>> printf(" In rank %d and host= %s  Do Barrier call
>> 1.\n",myrank,hoster);
>> MPI_Barrier(MPI_COMM_WORLD);
>> printf(" In rank %d and host= %s  Do Barrier call
>> 2.\n",myrank,hoster);
>> MPI_Barrier(MPI_COMM_WORLD);
>> printf(" In rank %d and host= %s  Do Barrier call
>> 3.\n",myrank,hoster);
>> MPI_Barrier(MPI_COMM_WORLD);
>> MPI_Finalize();
>> exit(0);
>> }
>>
>>   Here are three runs of test program.  First with two processes on one
>> host, then with
>> two processes on another host, and finally with one process on each of
>> two hosts.  The
>> first two runs are fine but the last 

Re: [OMPI users] users Digest, Vol 2881, Issue 1

2014-05-06 Thread Clay Kirkland
 That last trick seems to work.  I can get it to work once in a while with
those tcp options but it is
tricky as I have three machines and two of them use eth0 as primary network
interface and one
uses eth1.   But by fiddling with network options and perhaps moving a
cable or two I think I can
get it all to workThanks much for the tip.

 Clay


On Tue, May 6, 2014 at 11:00 AM,  wrote:

> Send users mailing list submissions to
> us...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@open-mpi.org
>
> You can reach the person managing the list at
> users-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>1. Re: MPI_Barrier hangs on second attempt but only  when
>   multiple hosts used. (Daniels, Marcus G)
>2. ROMIO bug reading darrays (Richard Shaw)
>3. MPI File Open does not work (Imran Ali)
>4. Re: MPI File Open does not work (Jeff Squyres (jsquyres))
>5. Re: MPI File Open does not work (Imran Ali)
>6. Re: MPI File Open does not work (Jeff Squyres (jsquyres))
>7. Re: MPI File Open does not work (Imran Ali)
>8. Re: MPI File Open does not work (Jeff Squyres (jsquyres))
>9. Re: users Digest, Vol 2879, Issue 1 (Jeff Squyres (jsquyres))
>
>
> --
>
> Message: 1
> Date: Mon, 5 May 2014 19:28:07 +
> From: "Daniels, Marcus G" 
> To: "'us...@open-mpi.org'" 
> Subject: Re: [OMPI users] MPI_Barrier hangs on second attempt but only
> whenmultiple hosts used.
> Message-ID:
> <
> 532c594b7920a549a2a91cb4312cc57640dc5...@ecs-exg-p-mb01.win.lanl.gov>
> Content-Type: text/plain; charset="utf-8"
>
>
>
> From: Clay Kirkland [mailto:clay.kirkl...@versityinc.com]
> Sent: Friday, May 02, 2014 03:24 PM
> To: us...@open-mpi.org 
> Subject: [OMPI users] MPI_Barrier hangs on second attempt but only when
> multiple hosts used.
>
> I have been using MPI for many many years so I have very well debugged mpi
> tests.   I am
> having trouble on either openmpi-1.4.5  or  openmpi-1.6.5 versions though
> with getting the
> MPI_Barrier calls to work.   It works fine when I run all processes on one
> machine but when
> I run with two or more hosts the second call to MPI_Barrier always hangs.
>   Not the first one,
> but always the second one.   I looked at FAQ's and such but found nothing
> except for a comment
> that MPI_Barrier problems were often problems with fire walls.  Also
> mentioned as a problem
> was not having the same version of mpi on both machines.  I turned
> firewalls off and removed
> and reinstalled the same version on both hosts but I still see the same
> thing.   I then installed
> lam mpi on two of my machines and that works fine.   I can call the
> MPI_Barrier function when run on
> one of two machines by itself  many times with no hangs.  Only hangs if
> two or more hosts are involved.
> These runs are all being done on CentOS release 6.4.   Here is test
> program I used.
>
> main (argc, argv)
> int argc;
> char **argv;
> {
> char message[20];
> char hoster[256];
> char nameis[256];
> int fd, i, j, jnp, iret, myrank, np, ranker, recker;
> MPI_Comm comm;
> MPI_Status status;
>
> MPI_Init( ,  );
> MPI_Comm_rank( MPI_COMM_WORLD, );
> MPI_Comm_size( MPI_COMM_WORLD, );
>
> gethostname(hoster,256);
>
> printf(" In rank %d and host= %s  Do Barrier call
> 1.\n",myrank,hoster);
> MPI_Barrier(MPI_COMM_WORLD);
> printf(" In rank %d and host= %s  Do Barrier call
> 2.\n",myrank,hoster);
> MPI_Barrier(MPI_COMM_WORLD);
> printf(" In rank %d and host= %s  Do Barrier call
> 3.\n",myrank,hoster);
> MPI_Barrier(MPI_COMM_WORLD);
> MPI_Finalize();
> exit(0);
> }
>
>   Here are three runs of test program.  First with two processes on one
> host, then with
> two processes on another host, and finally with one process on each of two
> hosts.  The
> first two runs are fine but the last run hangs on the second MPI_Barrier.
>
> [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host centos a.out
>  In rank 0 and host= centos  Do Barrier call 1.
>  In rank 1 and host= centos  Do Barrier call 1.
>  In rank 1 and host= centos  Do Barrier call 2.
>  In rank 1 and host= centos  Do Barrier call 3.
>  In rank 0 and host= centos  Do Barrier call 2.
>  In rank 0 and host= centos  Do Barrier call 3.
> [root@centos MPI]# /usr/local/bin/mpirun -np 2 --host RAID a.out
> /root/.bashrc: line 14: unalias: ls: not found
>  In rank 0 and host= RAID  Do Barrier call 1.
>  In rank 0 and host= RAID  Do Barrier call 2.
>  In rank 0 and host= RAID  Do Barrier call 3.
>  In 

Re: [OMPI users] users Digest, Vol 2879, Issue 1

2014-05-06 Thread Jeff Squyres (jsquyres)
Are you using TCP as the MPI transport?

If so, another thing to try is to limit the IP interfaces that MPI uses for its 
traffic to see if there's some kind of problem with specific networks.

For example:

   mpirun --mca btl_tcp_if_include eth0 ...

If that works, then try adding in any/all other IP interfaces that you have on 
your machines.

A sorta-wild guess: you have some IP interfaces that aren't working, or at 
least, don't work in the way that OMPI wants them to work.  So the first 
barrier works because it flows across eth0 (or some other first network that, 
as far as OMPI is concerned, works just fine).  But then the next barrier 
round-robin advances to the next IP interface, and it doesn't work for some 
reason.

We've seen virtual machine bridge interfaces cause problems, for example.  
E.g., if a machine has a Xen virtual machine interface (vibr0, IIRC?), then 
OMPI will try to use it to communicate with peer MPI processes because it has a 
"compatible" IP address, and OMPI think it should be connected/reachable to 
peers.  If this is the case, you might want to disable such interfaces and/or 
use btl_tcp_if_include or btl_tcp_if_exclude to select the interfaces that you 
want to use.

Pro tip: if you use btl_tcp_if_exclude, remember to exclude the loopback 
interface, too.  OMPI defaults to a btl_tcp_if_include="" (blank) and 
btl_tcp_if_exclude="lo0". So if you override btl_tcp_if_exclude, you need to be 
sure to *also* include lo0 in the new value.  For example:

   mpirun --mca btl_tcp_if_exclude lo0,virb0 ...

Also, if possible, try upgrading to Open MPI 1.8.1.



On May 4, 2014, at 2:15 PM, Clay Kirkland  wrote:

>  I am configuring with all defaults.   Just doing a ./configure and then
> make and make install.   I have used open mpi on several kinds of 
> unix  systems this way and have had no trouble before.   I believe I
> last had success on a redhat version of linux.
> 
> 
> On Sat, May 3, 2014 at 11:00 AM,  wrote:
> Send users mailing list submissions to
> us...@open-mpi.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@open-mpi.org
> 
> You can reach the person managing the list at
> users-ow...@open-mpi.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
> 
> 
> Today's Topics:
> 
>1. MPI_Barrier hangs on second attempt but only when multiple
>   hosts used. (Clay Kirkland)
>2. Re: MPI_Barrier hangs on second attempt but only when
>   multiple hosts used. (Ralph Castain)
> 
> 
> --
> 
> Message: 1
> Date: Fri, 2 May 2014 16:24:04 -0500
> From: Clay Kirkland 
> To: us...@open-mpi.org
> Subject: [OMPI users] MPI_Barrier hangs on second attempt but only
> whenmultiple hosts used.
> Message-ID:
> 

Re: [OMPI users] MPI File Open does not work

2014-05-06 Thread Jeff Squyres (jsquyres)
On May 6, 2014, at 9:40 AM, Imran Ali  wrote:

> My install was in my user directory (i.e $HOME). I managed to locate the 
> source directory and successfully run make uninstall.


FWIW, I usually install Open MPI into its own subdir.  E.g., 
$HOME/installs/openmpi-x.y.z.  Then if I don't want that install any more, I 
can just "rm -rf $HOME/installs/openmpi-x.y.z" -- no need to "make uninstall".  
Specifically: if there's nothing else installed in the same tree as Open MPI, 
you can just rm -rf its installation tree.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] MPI File Open does not work

2014-05-06 Thread Imran Ali

6. mai 2014 kl. 15:34 skrev Jeff Squyres (jsquyres) :

> On May 6, 2014, at 9:32 AM, Imran Ali  wrote:
> 
>> I will attempt that than. I read at
>> 
>> http://www.open-mpi.org/faq/?category=building#install-overwrite 
>> 
>> that I should completely uninstall my previous version.
> 
> Yes, that is best.  OR: you can install into a whole separate tree and ignore 
> the first installation.
> 
>> Could you recommend to me how I can go about doing it (without root access).
>> I am uncertain where I can use make uninstall.
> 
> If you don't have write access into the installation tree (i.e., it was 
> installed via root and you don't have root access), then your best bet is 
> simply to install into a new tree.  E.g., if OMPI is installed into 
> /opt/openmpi-1.6.2, try installing into /opt/openmpi-1.6.5, or even 
> $HOME/installs/openmpi-1.6.5, or something like that.

My install was in my user directory (i.e $HOME). I managed to locate the source 
directory and successfully run make uninstall.

Will let you know how things went after installation.

Imran

> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI File Open does not work

2014-05-06 Thread Jeff Squyres (jsquyres)
On May 6, 2014, at 9:32 AM, Imran Ali  wrote:

> I will attempt that than. I read at
> 
> http://www.open-mpi.org/faq/?category=building#install-overwrite 
> 
> that I should completely uninstall my previous version.

Yes, that is best.  OR: you can install into a whole separate tree and ignore 
the first installation.

> Could you recommend to me how I can go about doing it (without root access).
> I am uncertain where I can use make uninstall.

If you don't have write access into the installation tree (i.e., it was 
installed via root and you don't have root access), then your best bet is 
simply to install into a new tree.  E.g., if OMPI is installed into 
/opt/openmpi-1.6.2, try installing into /opt/openmpi-1.6.5, or even 
$HOME/installs/openmpi-1.6.5, or something like that.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] MPI File Open does not work

2014-05-06 Thread Imran Ali

6. mai 2014 kl. 14:56 skrev Jeff Squyres (jsquyres) :

> The thread support in the 1.6 series is not very good.  You might try:
> 
> - Upgrading to 1.6.5
> - Or better yet, upgrading to 1.8.1
> 

I will attempt that than. I read at

http://www.open-mpi.org/faq/?category=building#install-overwrite 

that I should completely uninstall my previous version. Could you recommend to 
me how I can go about doing it (without root access).
I am uncertain where I can use make uninstall.

Imran

> 
> On May 6, 2014, at 7:24 AM, Imran Ali  wrote:
> 
>> I get the following error when I try to run the following python code
>> 
>> import mpi4py.MPI as MPI
>> comm =  MPI.COMM_WORLD
>> MPI.File.Open(comm,"some.file")
>> 
>> $ mpirun -np 1 python test_mpi.py
>> Traceback (most recent call last):
>>  File "test_mpi.py", line 3, in 
>>MPI.File.Open(comm," h5ex_d_alloc.h5")
>>  File "File.pyx", line 67, in mpi4py.MPI.File.Open (src/mpi4py.MPI.c:89639)
>> mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
>> --
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --
>> 
>> My mpirun version is (Open MPI) 1.6.2. I installed openmpi using the dorsal 
>> script (https://github.com/FEniCS/dorsal) for Redhat Enterprise 6 (OS I am 
>> using, release 6.5) . It configured the build as following :
>> 
>> ./configure --enable-mpi-thread-multiple --enable-opal-multi-threads 
>> --with-threads=posix --disable-mpi-profile
>> 
>> I need emphasize that I do not have root acces on the system I am running my 
>> application.
>> 
>> Imran
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI File Open does not work

2014-05-06 Thread Jeff Squyres (jsquyres)
The thread support in the 1.6 series is not very good.  You might try:

- Upgrading to 1.6.5
- Or better yet, upgrading to 1.8.1


On May 6, 2014, at 7:24 AM, Imran Ali  wrote:

> I get the following error when I try to run the following python code
> 
> import mpi4py.MPI as MPI
> comm =  MPI.COMM_WORLD
> MPI.File.Open(comm,"some.file")
>  
> $ mpirun -np 1 python test_mpi.py
> Traceback (most recent call last):
>   File "test_mpi.py", line 3, in 
> MPI.File.Open(comm," h5ex_d_alloc.h5")
>   File "File.pyx", line 67, in mpi4py.MPI.File.Open (src/mpi4py.MPI.c:89639)
> mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
>  
> My mpirun version is (Open MPI) 1.6.2. I installed openmpi using the dorsal 
> script (https://github.com/FEniCS/dorsal) for Redhat Enterprise 6 (OS I am 
> using, release 6.5) . It configured the build as following :
>  
> ./configure --enable-mpi-thread-multiple --enable-opal-multi-threads 
> --with-threads=posix --disable-mpi-profile
> 
> I need emphasize that I do not have root acces on the system I am running my 
> application.
>  
> Imran
>  
>  
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI users] MPI File Open does not work

2014-05-06 Thread Imran Ali


I get the following error when I try to run the following python
code 
import mpi4py.MPI as MPI 
comm = MPI.COMM_WORLD

MPI.File.Open(comm,"some.file") 

$ mpirun -np 1 python
test_mpi.py
Traceback (most recent call last):
 File "test_mpi.py", line
3, in 
 MPI.File.Open(comm," h5ex_d_alloc.h5")
 File "File.pyx",
line 67, in mpi4py.MPI.File.Open
(src/mpi4py.MPI.c:89639)
mpi4py.MPI.Exception: MPI_ERR_OTHER: known
error not in
list
--
mpirun
noticed that the job aborted, but has no info as to the process
that
caused that
situation.
--


My mpirun version is (Open MPI) 1.6.2. I installed openmpi using the
dorsal script (https://github.com/FEniCS/dorsal) for Redhat Enterprise 6
(OS I am using, release 6.5) . It configured the build as following :


./configure --enable-mpi-thread-multiple --enable-opal-multi-threads
--with-threads=posix --disable-mpi-profile

I need emphasize that I do
not have root acces on the system I am running my application. 

Imran