Re: [OMPI users] tcp connectivity OS X and 1.3.3

Jody Klymak Wed, 12 Aug 2009 14:52:02 -0400

Hi Ralph,

That gives me something more to work with...



On Aug 12, 2009, at  9:44 AM, Ralph Castain wrote:

I believe TCP works fine, Jody, as it is used on Macs fairly widely.I suspect this is something funny about your installation.
One thing I have found is that you can get this error message whenyou have multiple NICs installed, each with a different subnet, andthe procs try to connect across different ones. Do you by chancehave multiple NICs?


The head node has two active NICs:
en0: public
en1: private

The server nodes only have one connection
en0:private

Have you tried telling OMPI which TCP interface to use? You can doso with -mca btl_tcp_if_include eth0 (or whatever you want to use).

If I try this, I get the same results. (though I need to use "en0" onmy machine)...


If I include -mca btl_base_verbose 30 I get for n=2:

++++++++++
[xserve03.local:00841] select: init of component tcp returned success
Done MPI init
checking connection between rank 0 on xserve02.local and rank 1
Done MPI init

[xserve02.local:01094] btl: tcp: attempting to connect() to address192.168.2.103 on port 4

Done checking connection between rank 0 on xserve02.local and rank 1
Connectivity test on 2 processes PASSED.
++++++++++

If I try n=3 the job hangs and I have to kill:

++++++++++
Done MPI init
checking connection between rank 0 on xserve02.local and rank 1

[xserve02.local:01110] btl: tcp: attempting to connect() to address192.168.2.103 on port 4

Done MPI init
Done MPI init
checking connection between rank 1 on xserve03.local and rank 2

[xserve03.local:00860] btl: tcp: attempting to connect() to address192.168.2.102 on port 4

Done checking connection between rank 0 on xserve02.local and rank 1
checking connection between rank 0 on xserve02.local and rank 2
Done checking connection between rank 0 on xserve02.local and rank 2
mpirun: killing job...
++++++++++

Those ip addresses are correct, no idea if port 4 make sense.Sometimes I get port 260. Should xserve03 and xserve02 be trying touse the same port for these comms?



Thanks,  Jody



On Wed, Aug 12, 2009 at 10:01 AM, Jody Klymak <jkly...@uvic.ca> wrote:

On Aug 11, 2009, at  18:55 PM, Gus Correa wrote:


Did you wipe off the old directories before reinstalling?

Check.

I prefer to install on a NFS mounted directory,

Check


Have you tried to ssh from node to node on all possible pairs?

check - fixed this today, works fine with the spawning user...

How could you roll back to 1.1.5,
now that you overwrote the directories?

Oh, I still have it on another machine off the cluster in /usr/local/openmpi. Will take just 5 mintues to reinstall.


Launching jobs with Torque is way much better than
using barebones mpirun.

And you don't want to stay behind with the OpenMPI versions
and improvements either.

Sure, but I'd like the jobs to be able to run at all..

Is there any sense in rolling back to to 1.2.3 since that is knownto work with OS X (its the one that comes with 10.5)? My only guessat this point is other OS X users are using non-tcpip communication,and the tcp stuff just doesn't work in 1.3.3.


Thanks,  Jody

--
Jody Klymak
http://web.uvic.ca/~jklymak/




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jody Klymak
http://web.uvic.ca/~jklymak/

Re: [OMPI users] tcp connectivity OS X and 1.3.3

Reply via email to