Hi All, Yes I can ping and ssh from apex-backpack to my Mac (fuji.local). I fixed the wireless broadcast to reflect the same on both ends (10.11.14.255) but still the problem persists.
I have tried other wireless adapters as well. But no luck till far. Please let me know what can be done... regards, pallab > (putting this back on the list where others can reply as well, and if > we solve it, the solution will be google-ized) > > According to your debug output: > >>> [apex-backpack:31956] btl: tcp: attempting to connect() to address >>> 10.11.14.203 on port 9360 > > It *is* trying to connect to the right IP address. Are you able to > ping to .203 from apex-backpack? > > I also notice that you ethernet configuration does not exactly match > between linux and osx: > > en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500 > inet 10.11.14.203 netmask 0xfffff000 broadcast 10.11.15.255 > > wlan0 Link encap:Ethernet HWaddr 00:21:79:c2:54:c7 > inet addr:10.11.14.205 Bcast:10.11.14.255 Mask: > 255.255.240.0 > > > On Sep 22, 2009, at 9:26 PM, Pallab Datta wrote: > >> There is no firewall running between the machines. I tried using the >> IP >> address instead of localhost but it gave me the same output. MPI is >> not >> even timing out..it keeps eternally hanging on..:( >> >> I have disabled the ethernet interface on the linux box, keeping >> only the >> wireless up. On the mac i only have the ethernet turned on. My mac >> is a 8 >> core mac pro. >> >> Please help me debug this.. >> thanks in advance, regards, >> pallab >> >> >>> (only replying to users list) >>> >>> Some suggestions: >>> >>> - MPI seems to startup but the additional TCP connections required >>> for >>> MPI connections seem to be failing / timing out / some other error. >>> - Are you running firewalls between your machines? If so, can you >>> disable them? >>> - I see that you're specifying "--mca btl_tcp_port_min_v4 36900" but >>> one of the debug lines reads: >>>> [apex-backpack:31956] btl: tcp: attempting to connect() to address >>>> 10.11.14.203 on port 9360 >>> - Try not using the name "localhost", but rather the IP address of >>> the >>> local machine >>> >>> >>> On Sep 22, 2009, at 5:27 PM, Pallab Datta wrote: >>> >>>> The following are the ifconfig for both the Mac and the Linux >>>> respectively: >>>> >>>> fuji:openmpi-1.3.3 pallabdatta$ ifconfig >>>> lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384 >>>> inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 >>>> inet 127.0.0.1 netmask 0xff000000 >>>> inet6 ::1 prefixlen 128 >>>> gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280 >>>> stf0: flags=0<> mtu 1280 >>>> en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu >>>> 1500 >>>> inet6 fe80::21f:5bff:fe3d:eaac%en0 prefixlen 64 scopeid 0x4 >>>> inet 10.11.14.203 netmask 0xfffff000 broadcast 10.11.15.255 >>>> ether 00:1f:5b:3d:ea:ac >>>> media: autoselect (100baseTX <full-duplex>) status: active >>>> supported media: autoselect 10baseT/UTP <half-duplex> 10baseT/UTP >>>> <full-duplex> 10baseT/UTP <full-duplex,hw-loopback> 10baseT/UTP >>>> <full-duplex,flow-control> 100baseTX <half-duplex> 100baseTX >>>> <full-duplex> 100baseTX <full-duplex,hw-loopback> 100baseTX >>>> <full-duplex,flow-control> 1000baseT <full-duplex> 1000baseT >>>> <full-duplex,hw-loopback> 1000baseT <full-duplex,flow-control> >>>> en1: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu >>>> 1500 >>>> ether 00:1f:5b:3d:ea:ad >>>> media: autoselect status: inactive >>>> supported media: autoselect 10baseT/UTP <half-duplex> 10baseT/UTP >>>> <full-duplex> 10baseT/UTP <full-duplex,hw-loopback> 10baseT/UTP >>>> <full-duplex,flow-control> 100baseTX <half-duplex> 100baseTX >>>> <full-duplex> 100baseTX <full-duplex,hw-loopback> 100baseTX >>>> <full-duplex,flow-control> 1000baseT <full-duplex> 1000baseT >>>> <full-duplex,hw-loopback> 1000baseT <full-duplex,flow-control> >>>> fw0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu >>>> 4078 >>>> lladdr 00:22:41:ff:fe:ed:7d:a8 >>>> media: autoselect <full-duplex> status: inactive >>>> supported media: autoselect <full-duplex> >>>> >>>> >>>> LINUX: >>>> ==== >>>> pallabdatta@apex-backpack:~/backpack/src$ ifconfig >>>> lo Link encap:Local Loopback >>>> inet addr:127.0.0.1 Mask:255.0.0.0 >>>> inet6 addr: ::1/128 Scope:Host >>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>>> RX packets:116 errors:0 dropped:0 overruns:0 frame:0 >>>> TX packets:116 errors:0 dropped:0 overruns:0 carrier:0 >>>> collisions:0 txqueuelen:0 >>>> RX bytes:11788 (11.7 KB) TX bytes:11788 (11.7 KB) >>>> >>>> wlan0 Link encap:Ethernet HWaddr 00:21:79:c2:54:c7 >>>> inet addr:10.11.14.205 Bcast:10.11.14.255 Mask: >>>> 255.255.240.0 >>>> inet6 addr: fe80::221:79ff:fec2:54c7/64 Scope:Link >>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >>>> RX packets:72531 errors:0 dropped:0 overruns:0 frame:0 >>>> TX packets:28894 errors:0 dropped:0 overruns:0 carrier:0 >>>> collisions:0 txqueuelen:1000 >>>> RX bytes:5459312 (5.4 MB) TX bytes:7264193 (7.2 MB) >>>> >>>> wmaster0 Link encap:UNSPEC HWaddr >>>> 00-21-79-C2-54-C7-34-63-00-00-00-00-00-00-00-00 >>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>> collisions:0 txqueuelen:1000 >>>> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >>>> >>>> The mac is a Two 2.26GHz Quad-Core Intel Xeon Mac Pro and the Linux >>>> Box is >>>> Ubuntu Server Edition 9.04. The Mac has the ethernet interface to >>>> connect >>>> to the network and the linux box connects via a wireless adapter >>>> (IOGEAR). >>>> >>>> Please help me any way I can fix this issue. It really needs to work >>>> for >>>> our project. >>>> thanks in advance, >>>> regards, >>>> pallab >>>> >>>> >>>> >>>> >>>> >>>>> My other concern was the following but I am not sure it applies >>>>> here. >>>>> If you have multiple interfaces on the node, and they are on the >>>>> same >>>>> subnet, then you cannot actually select what IP address to go out >>>>> of. >>>>> You can only select the IP address you want to connect to. In these >>>>> cases, I have seen a hang because we think we are selecting an IP >>>>> address to go out of, but it actually goes out the other one. >>>>> Perhaps you can send the User's list the output from "ifconfig" on >>>>> each >>>>> of the machines which would show all the interfaces. You need to >>>>> get the >>>>> right arguments for ifconfig depending on the OS you are running >>>>> on. >>>>> >>>>> One thought is make sure the ethernet interface is marked down on >>>>> both >>>>> boxes if that is possible. >>>>> >>>>> Pallab Datta wrote: >>>>>> Any suggestions on to how to debug this further..?? >>>>>> do you think I need to enable any other option besides >>>>>> heterogeneous at >>>>>> the configure proompt.? >>>>>> >>>>>> >>>>>>> The -enable-heterogeneous should do the trick. And to answer the >>>>>>> previous question, yes, put both of the interfaces in the include >>>>>>> list. >>>>>>> >>>>>>> --mca btl_tcp_if_include en0,wlan0 >>>>>>> >>>>>>> If that does not work, then I may have one other thought why it >>>>>>> might >>>>>>> not work although perhaps not a solution. >>>>>>> >>>>>>> Rolf >>>>>>> >>>>>>> Pallab Datta wrote: >>>>>>> >>>>>>>> Hi Rolf, >>>>>>>> >>>>>>>> Do i need to configure openmpi with some specific options apart >>>>>>>> from >>>>>>>> --enable-heterogeneous..? >>>>>>>> I am currently using >>>>>>>> ./configure --prefix=/usr/local/ --enable-heterogeneous >>>>>>>> --disable-static >>>>>>>> --enable-shared --enable-debug >>>>>>>> >>>>>>>> on both ends...is the above correct..?! Please let me know. >>>>>>>> thanks and regards, >>>>>>>> pallab >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Hi: >>>>>>>>> I assume if you wait several minutes than your program will >>>>>>>>> actually >>>>>>>>> time out, yes? I guess I have two suggestions. First, can you >>>>>>>>> run a >>>>>>>>> non-MPI job using the wireless? Something like hostname? >>>>>>>>> Secondly, >>>>>>>>> you >>>>>>>>> may want to specify the specific interfaces you want it to use >>>>>>>>> on the >>>>>>>>> two machines. You can do that via the "--mca >>>>>>>>> btl_tcp_if_include" >>>>>>>>> run-time parameter. Just list the ones that you expect it to >>>>>>>>> use. >>>>>>>>> >>>>>>>>> Also, this is not right - "--mca OMPI_mca_mpi_preconnect_all >>>>>>>>> 1" It >>>>>>>>> should be --mca mpi_preconnect_mpi 1 if you want to do the >>>>>>>>> connection >>>>>>>>> during MPI_Init. >>>>>>>>> >>>>>>>>> Rolf >>>>>>>>> >>>>>>>>> Pallab Datta wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>>> The following is the error dump >>>>>>>>>> >>>>>>>>>> fuji:src pallabdatta$ /usr/local/bin/mpirun --mca >>>>>>>>>> btl_tcp_port_min_v4 >>>>>>>>>> 36900 -mca btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 >>>>>>>>>> --mca >>>>>>>>>> btl >>>>>>>>>> tcp,self --mca OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H >>>>>>>>>> localhost,10.11.14.205 /tmp/hello >>>>>>>>>> [fuji.local:01316] mca: base: components_open: Looking for btl >>>>>>>>>> components >>>>>>>>>> [fuji.local:01316] mca: base: components_open: opening btl >>>>>>>>>> components >>>>>>>>>> [fuji.local:01316] mca: base: components_open: found loaded >>>>>>>>>> component >>>>>>>>>> self >>>>>>>>>> [fuji.local:01316] mca: base: components_open: component self >>>>>>>>>> has no >>>>>>>>>> register function >>>>>>>>>> [fuji.local:01316] mca: base: components_open: component self >>>>>>>>>> open >>>>>>>>>> function successful >>>>>>>>>> [fuji.local:01316] mca: base: components_open: found loaded >>>>>>>>>> component >>>>>>>>>> tcp >>>>>>>>>> [fuji.local:01316] mca: base: components_open: component tcp >>>>>>>>>> has no >>>>>>>>>> register function >>>>>>>>>> [fuji.local:01316] mca: base: components_open: component tcp >>>>>>>>>> open >>>>>>>>>> function >>>>>>>>>> successful >>>>>>>>>> [fuji.local:01316] select: initializing btl component self >>>>>>>>>> [fuji.local:01316] select: init of component self returned >>>>>>>>>> success >>>>>>>>>> [fuji.local:01316] select: initializing btl component tcp >>>>>>>>>> [fuji.local:01316] select: init of component tcp returned >>>>>>>>>> success >>>>>>>>>> [apex-backpack:04753] mca: base: components_open: Looking for >>>>>>>>>> btl >>>>>>>>>> components >>>>>>>>>> [apex-backpack:04753] mca: base: components_open: opening btl >>>>>>>>>> components >>>>>>>>>> [apex-backpack:04753] mca: base: components_open: found loaded >>>>>>>>>> component >>>>>>>>>> self >>>>>>>>>> [apex-backpack:04753] mca: base: components_open: component >>>>>>>>>> self has >>>>>>>>>> no >>>>>>>>>> register function >>>>>>>>>> [apex-backpack:04753] mca: base: components_open: component >>>>>>>>>> self >>>>>>>>>> open >>>>>>>>>> function successful >>>>>>>>>> [apex-backpack:04753] mca: base: components_open: found loaded >>>>>>>>>> component >>>>>>>>>> tcp >>>>>>>>>> [apex-backpack:04753] mca: base: components_open: component >>>>>>>>>> tcp has >>>>>>>>>> no >>>>>>>>>> register function >>>>>>>>>> [apex-backpack:04753] mca: base: components_open: component >>>>>>>>>> tcp open >>>>>>>>>> function successful >>>>>>>>>> [apex-backpack:04753] select: initializing btl component self >>>>>>>>>> [apex-backpack:04753] select: init of component self returned >>>>>>>>>> success >>>>>>>>>> [apex-backpack:04753] select: initializing btl component tcp >>>>>>>>>> [apex-backpack:04753] select: init of component tcp returned >>>>>>>>>> success >>>>>>>>>> Process 0 on fuji.local out of 2 >>>>>>>>>> Process 1 on apex-backpack out of 2 >>>>>>>>>> [apex-backpack:04753] btl: tcp: attempting to connect() to >>>>>>>>>> address >>>>>>>>>> 10.11.14.203 on port 9360 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Hi >>>>>>>>>>> >>>>>>>>>>> I am trying to run open-mpi 1.3.3. between a linux box >>>>>>>>>>> running >>>>>>>>>>> ubuntu >>>>>>>>>>> server v.9.04 and a Macintosh. I have configured openmpi with >>>>>>>>>>> the >>>>>>>>>>> following options.: >>>>>>>>>>> ./configure --prefix=/usr/local/ --enable-heterogeneous >>>>>>>>>>> --disable-shared >>>>>>>>>>> --enable-static >>>>>>>>>>> >>>>>>>>>>> When both the machines are connected to the network via >>>>>>>>>>> ethernet >>>>>>>>>>> cables >>>>>>>>>>> openmpi works fine. >>>>>>>>>>> >>>>>>>>>>> But when I switch the linux box to a wireless adapter i can >>>>>>>>>>> reach >>>>>>>>>>> (ping) >>>>>>>>>>> the macintosh >>>>>>>>>>> but openmpi hangs on a hello world program. >>>>>>>>>>> >>>>>>>>>>> I ran : >>>>>>>>>>> >>>>>>>>>>> /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca >>>>>>>>>>> btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca >>>>>>>>>>> OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H >>>>>>>>>>> localhost,10.11.14.205 >>>>>>>>>>> /tmp/back >>>>>>>>>>> >>>>>>>>>>> it hangs on a send receive function between the two ends. All >>>>>>>>>>> my >>>>>>>>>>> firewalls >>>>>>>>>>> are turned off at the macintosh end. PLEASE HELP ASAP> >>>>>>>>>>> regards, >>>>>>>>>>> pallab >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> ========================= >>>>>>>>> rolf.vandeva...@sun.com >>>>>>>>> 781-442-3043 >>>>>>>>> ========================= >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> -- >>>>>>> >>>>>>> ========================= >>>>>>> rolf.vandeva...@sun.com >>>>>>> 781-442-3043 >>>>>>> ========================= >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> ========================= >>>>> rolf.vandeva...@sun.com >>>>> 781-442-3043 >>>>> ========================= >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> > > > -- > Jeff Squyres > jsquy...@cisco.com > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >