Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?

2009-09-22 Thread Jonathan Dursi

Hi, Jeff:

I wish I had your problems reproducing this.  This problem apparently  
rears its head when OpenMPI is compiled with the intel compilers, as  
well, but only ~1% of the time.  Unfortunately, we have users who  
launch ~1400 single-node jobs at a go.   So they see on order a dozen  
or two jobs hang per suite of simulations when using the defaults, but  
their problem goes away when they use -mca btl self,tcp, or when they  
use sm but set the number of fifos to np-1.


At first I had assumed it was a new-ish-architecture thing, as we  
first saw the problem on the Nehalem Xeon E5540 nodes, but the sample  
program hangs in exactly the same way on a Harpertown (E5430) machine  
as well.   So I've been assuming that this is a real problem that for  
whatever reason is just exposed more with this particular version of  
this particular compiler.  I'd love to be wrong and for it to be  
something strange but easily changed in our environment that is  
causing this.


Running with your suggested test change, eg
   leftneighbour = rank-1
   if (leftneighbour .eq. -1) then
!  leftneighbour = nprocs-1
  leftneighbour = MPI_PROC_NULL
   endif
   rightneighbour = rank+1
   if (rightneighbour .eq. nprocs) then
!  rightneighbour = 0
  rightneighbour = MPI_PROC_NULL
   endif

like so:
mpirun -np 6 -mca btl self,sm,tcp ./diffusion-mpi

I do seem to get different behaviour.  With OpenMPI 1.3.2, the program  
frequently runs to completion, but when it does so it hangs at the  
end, which hadn't happened before -- attaching gdb to a process tells  
me that it's hanging in mpi_finalize;

(gdb) where
#0  0x2b3635ecb51f in poll () from /lib64/libc.so.6
#1  0x2b3634bd87c1 in poll_dispatch () from /scinet/gpc/mpi/ 
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0
#2  0x2b3634bd7659 in opal_event_base_loop () from /scinet/gpc/mpi/ 
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0
#3  0x2b3634bcc189 in opal_progress () from /scinet/gpc/mpi/ 
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0
#4  0x2b3636d7cf15 in barrier () from /scinet/gpc/mpi/openmpi/ 
1.3.2-gcc-v4.4.0-ofed/lib/openmpi/mca_grpcomm_bad.so
#5  0x2b363470158b in ompi_mpi_finalize () from /scinet/gpc/mpi/ 
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0
#6  0x2b36344bb529 in pmpi_finalize__ () from /scinet/gpc/mpi/ 
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0

#7  0x00400f99 in MAIN__ ()
#8  0x00400fda in main (argc=1, argv=0x7fff3e3908c8)  
at ../../../gcc-4.4.0/libgfortran/fmain.c:21

(gdb)

The rest of the time  (maybe 1/4 of the time?) it hangs mid-run, in  
the sendrecv:

(gdb) where
#0  0x2b2bb44b4230 in mca_pml_ob1_send () from /scinet/gpc/mpi/ 
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/openmpi/mca_pml_ob1.so
#1  0x2b2baf47d296 in PMPI_Sendrecv () from /scinet/gpc/mpi/ 
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0
#2  0x2b2baf215540 in pmpi_sendrecv__ () from /scinet/gpc/mpi/ 
openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0

#3  0x00400ea6 in MAIN__ ()
#4  0x00400fda in main (argc=1, argv=0x7fff62d9b9c8)  
at ../../../gcc-4.4.0/libgfortran/fmain.c:21



When running with OpenMPI 1.3.3, I get hangs in the program  
significantly _more_ often with this change than before, typically in  
the sendrecv again


#0  0x2aeb89d6cf2b in mca_btl_sm_component_progress () from / 
scinet/gpc/mpi/openmpi/1.3.3-gcc-v4.4.0-ofed/lib/openmpi/mca_btl_sm.so
#1  0x2aeb849bd14a in opal_progress () from /scinet/gpc/mpi/ 
openmpi/1.3.3-gcc-v4.4.0-ofed/lib/libopen-pal.so.0
#2  0x2aeb8954f235 in mca_pml_ob1_send () from /scinet/gpc/mpi/ 
openmpi/1.3.3-gcc-v4.4.0-ofed/lib/openmpi/mca_pml_ob1.so
#3  0x2aeb84516586 in PMPI_Sendrecv () from /scinet/gpc/mpi/ 
openmpi/1.3.3-gcc-v4.4.0-ofed/lib/libmpi.so.0
#4  0x2aeb842ae5b0 in pmpi_sendrecv__ () from /scinet/gpc/mpi/ 
openmpi/1.3.3-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0

#5  0x00400ea6 in MAIN__ ()
#6  0x00400fda in main (argc=1, argv=0x7fff12a13068)  
at ../../../gcc-4.4.0/libgfortran/fmain.c:21


but again occasionally in the finalize, and (unlike with 1.3.2)  
occasional successful runs through completion.


Again, running the program with both versions of openmpi without sm
mpirun -np 6 -mca btl self,tcp  ./diffusion-mpi

or with num_fifos=(np-1):
mpirun -np 6 -mca btl self,sm -mca btl_sm_num_fifos 5 ./diffusion-mpi

seems to work fine.

- Jonathan

On 2009-09-22, at 8:52PM, Jeff Squyres wrote:


Johnathan --

Sorry for the delay in replying; thanks for posting again.

I'm actually unable to replicate your problem.  :-(  I have a new  
intel 8 core X5570 box; I'm running at np6 and np8 on both Open MPI  
1.3.2 and 1.3.3 and am not seeing the problem you're seeing.  I even  
made your sample program worse -- I made a and b be 100,000 element  
real arrays (increasing the count args in MPI_SENDRECV to 100,000 as  
well), and increased nsteps to 150,000,000. 

Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless

2009-09-22 Thread Jeff Squyres

(only replying to users list)

Some suggestions:

- MPI seems to startup but the additional TCP connections required for  
MPI connections seem to be failing / timing out / some other error.
- Are you running firewalls between your machines?  If so, can you  
disable them?
- I see that you're specifying "--mca btl_tcp_port_min_v4 36900" but  
one of the debug lines reads:

[apex-backpack:31956] btl: tcp: attempting to connect() to address
10.11.14.203 on port 9360
- Try not using the name "localhost", but rather the IP address of the  
local machine



On Sep 22, 2009, at 5:27 PM, Pallab Datta wrote:

The following are the ifconfig for both the Mac and the Linux  
respectively:


fuji:openmpi-1.3.3 pallabdatta$ ifconfig
lo0: flags=8049 mtu 16384
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
inet 127.0.0.1 netmask 0xff00
inet6 ::1 prefixlen 128
gif0: flags=8010 mtu 1280
stf0: flags=0<> mtu 1280
en0: flags=8863 mtu 1500
inet6 fe80::21f:5bff:fe3d:eaac%en0 prefixlen 64 scopeid 0x4
inet 10.11.14.203 netmask 0xf000 broadcast 10.11.15.255
ether 00:1f:5b:3d:ea:ac
media: autoselect (100baseTX ) status: active
supported media: autoselect 10baseT/UTP  10baseT/UTP
 10baseT/UTP  10baseT/UTP
 100baseTX  100baseTX
 100baseTX  100baseTX
 1000baseT  1000baseT
 1000baseT 
en1: flags=8863 mtu 1500
ether 00:1f:5b:3d:ea:ad
media: autoselect status: inactive
supported media: autoselect 10baseT/UTP  10baseT/UTP
 10baseT/UTP  10baseT/UTP
 100baseTX  100baseTX
 100baseTX  100baseTX
 1000baseT  1000baseT
 1000baseT 
fw0: flags=8863 mtu 4078
lladdr 00:22:41:ff:fe:ed:7d:a8
media: autoselect  status: inactive
supported media: autoselect 


LINUX:

pallabdatta@apex-backpack:~/backpack/src$ ifconfig
loLink encap:Local Loopback
 inet addr:127.0.0.1  Mask:255.0.0.0
 inet6 addr: ::1/128 Scope:Host
 UP LOOPBACK RUNNING  MTU:16436  Metric:1
 RX packets:116 errors:0 dropped:0 overruns:0 frame:0
 TX packets:116 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:0
 RX bytes:11788 (11.7 KB)  TX bytes:11788 (11.7 KB)

wlan0 Link encap:Ethernet  HWaddr 00:21:79:c2:54:c7
 inet addr:10.11.14.205  Bcast:10.11.14.255  Mask: 
255.255.240.0

 inet6 addr: fe80::221:79ff:fec2:54c7/64 Scope:Link
 UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
 RX packets:72531 errors:0 dropped:0 overruns:0 frame:0
 TX packets:28894 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:1000
 RX bytes:5459312 (5.4 MB)  TX bytes:7264193 (7.2 MB)

wmaster0  Link encap:UNSPEC  HWaddr
00-21-79-C2-54-C7-34-63-00-00-00-00-00-00-00-00
 UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
 RX packets:0 errors:0 dropped:0 overruns:0 frame:0
 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:1000
 RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

The mac is a Two 2.26GHz Quad-Core Intel Xeon Mac Pro and the Linux  
Box is
Ubuntu Server Edition 9.04. The Mac has the ethernet interface to  
connect
to the network and the linux box connects via a wireless adapter  
(IOGEAR).


Please help me any way I can fix this issue. It really needs to work  
for

our project.
thanks in advance,
regards,
pallab






My other concern was the following but I am not sure it applies here.
If you have multiple interfaces on the node, and they are on the same
subnet, then you cannot actually select what IP address to go out of.
You can only select the IP address you want to connect to. In these
cases, I have seen a hang because we think we are selecting an IP
address to go out of, but it actually goes out the other one.
Perhaps you can send the User's list the output from "ifconfig" on  
each
of the machines which would show all the interfaces. You need to  
get the

right arguments for ifconfig depending on the OS you are running on.

One thought is make sure the ethernet interface is marked down on  
both

boxes if that is possible.

Pallab Datta wrote:

Any suggestions on to how to debug this further..??
do you think I need to enable any other option besides  
heterogeneous at

the configure proompt.?



The -enable-heterogeneous should do the trick.  And to answer the
previous question, yes, put both of the interfaces in the include  
list.


--mca btl_tcp_if_include en0,wlan0

If that 

Re: [OMPI users] How to create multi-thread parallel program using thread-safe send and recv?

2009-09-22 Thread guosong

This is just a test example. The real project behind it needs to configure like 
that.

> From: te...@chem.gu.se
> To: us...@open-mpi.org
> Date: Wed, 23 Sep 2009 09:39:22 +1000
> Subject: Re: [OMPI users] How to create multi-thread parallel program using 
> thread-safe send and recv?
> 
> If you want all threads to communicate via MPI, and your initially
> launching multiple parents, I don't really see the advantage of using
> threads at all. Why not launch 12 MPI processes?
> 
> On Tue, 2009-09-22 at 10:32 -0700, Eugene Loh wrote:
> > guosong wrote: 
> > > Thanks for responding. I used a linux cluster. I think I would like
> > > to create a model that is multithreaded and each thread can make MPI
> > > calls. I attached test code as follow. It has two pthreads and there
> > > are MPI calls in both of those two threads. In the main function,
> > > there are also MPI calls. Should I use a full multithreading?
> > I guess so. It seems like the created threads are expected to make
> > independent/concurrent message-passing calls. Do read the link I
> > sent. You need to convert from MPI_Init to MPI_Init_thread(), asking
> > for a full-multithreaded model and checking that you got it. Also
> > note in main() that the MPI_Isend() calls should be matched with
> > MPI_Wait() or similar calls. I guess the parent thread will sit in
> > such calls while the child threads do their own message passing. Good
> > luck.
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_
Messenger10年嘉年华,礼品大奖等你拿!
http://10.msn.com.cn

Re: [OMPI users] How to create multi-thread parallel program using thread-safe send and recv?

2009-09-22 Thread Terry Frankcombe
If you want all threads to communicate via MPI, and your initially
launching multiple parents, I don't really see the advantage of using
threads at all.  Why not launch 12 MPI processes?

On Tue, 2009-09-22 at 10:32 -0700, Eugene Loh wrote:
> guosong wrote: 
> > Thanks for responding. I used a linux cluster. I think I would like
> > to create a model that is multithreaded and each thread can make MPI
> > calls. I attached test code as follow. It has two pthreads and there
> > are MPI calls in both of those two threads. In the main function,
> > there are also MPI calls. Should I use a full multithreading?
> I guess so.  It seems like the created threads are expected to make
> independent/concurrent message-passing calls.  Do read the link I
> sent.  You need to convert from MPI_Init to MPI_Init_thread(), asking
> for a full-multithreaded model and checking that you got it.  Also
> note in main() that the MPI_Isend() calls should be matched with
> MPI_Wait() or similar calls.  I guess the parent thread will sit in
> such calls while the child threads do their own message passing.  Good
> luck.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] MPI Parent-Child process query

2009-09-22 Thread Blesson Varghese
HI,



I am fairly new to MPI. I am just wondering if it is possible for a child
process in MPI to communicate with a process that is not a parent?
Assistance is much appreciated.



Many thanks and best regards,

Blesson.



Re: [OMPI users] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless

2009-09-22 Thread Pallab Datta
The following are the ifconfig for both the Mac and the Linux respectively:

fuji:openmpi-1.3.3 pallabdatta$ ifconfig
lo0: flags=8049 mtu 16384
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
inet 127.0.0.1 netmask 0xff00
inet6 ::1 prefixlen 128
gif0: flags=8010 mtu 1280
stf0: flags=0<> mtu 1280
en0: flags=8863 mtu 1500
inet6 fe80::21f:5bff:fe3d:eaac%en0 prefixlen 64 scopeid 0x4
inet 10.11.14.203 netmask 0xf000 broadcast 10.11.15.255
ether 00:1f:5b:3d:ea:ac
media: autoselect (100baseTX ) status: active
supported media: autoselect 10baseT/UTP  10baseT/UTP
 10baseT/UTP  10baseT/UTP
 100baseTX  100baseTX
 100baseTX  100baseTX
 1000baseT  1000baseT
 1000baseT 
en1: flags=8863 mtu 1500
ether 00:1f:5b:3d:ea:ad
media: autoselect status: inactive
supported media: autoselect 10baseT/UTP  10baseT/UTP
 10baseT/UTP  10baseT/UTP
 100baseTX  100baseTX
 100baseTX  100baseTX
 1000baseT  1000baseT
 1000baseT 
fw0: flags=8863 mtu 4078
lladdr 00:22:41:ff:fe:ed:7d:a8
media: autoselect  status: inactive
supported media: autoselect 


LINUX:

pallabdatta@apex-backpack:~/backpack/src$ ifconfig
loLink encap:Local Loopback
  inet addr:127.0.0.1  Mask:255.0.0.0
  inet6 addr: ::1/128 Scope:Host
  UP LOOPBACK RUNNING  MTU:16436  Metric:1
  RX packets:116 errors:0 dropped:0 overruns:0 frame:0
  TX packets:116 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:11788 (11.7 KB)  TX bytes:11788 (11.7 KB)

wlan0 Link encap:Ethernet  HWaddr 00:21:79:c2:54:c7
  inet addr:10.11.14.205  Bcast:10.11.14.255  Mask:255.255.240.0
  inet6 addr: fe80::221:79ff:fec2:54c7/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:72531 errors:0 dropped:0 overruns:0 frame:0
  TX packets:28894 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:5459312 (5.4 MB)  TX bytes:7264193 (7.2 MB)

wmaster0  Link encap:UNSPEC  HWaddr
00-21-79-C2-54-C7-34-63-00-00-00-00-00-00-00-00
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:0 errors:0 dropped:0 overruns:0 frame:0
  TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

The mac is a Two 2.26GHz Quad-Core Intel Xeon Mac Pro and the Linux Box is
Ubuntu Server Edition 9.04. The Mac has the ethernet interface to connect
to the network and the linux box connects via a wireless adapter (IOGEAR).

Please help me any way I can fix this issue. It really needs to work for
our project.
thanks in advance,
regards,
pallab





> My other concern was the following but I am not sure it applies here.
> If you have multiple interfaces on the node, and they are on the same
> subnet, then you cannot actually select what IP address to go out of.
> You can only select the IP address you want to connect to. In these
> cases, I have seen a hang because we think we are selecting an IP
> address to go out of, but it actually goes out the other one.
> Perhaps you can send the User's list the output from "ifconfig" on each
> of the machines which would show all the interfaces. You need to get the
> right arguments for ifconfig depending on the OS you are running on.
>
> One thought is make sure the ethernet interface is marked down on both
> boxes if that is possible.
>
> Pallab Datta wrote:
>> Any suggestions on to how to debug this further..??
>> do you think I need to enable any other option besides heterogeneous at
>> the configure proompt.?
>>
>>
>>> The -enable-heterogeneous should do the trick.  And to answer the
>>> previous question, yes, put both of the interfaces in the include list.
>>>
>>> --mca btl_tcp_if_include en0,wlan0
>>>
>>> If that does not work, then I may have one other thought why it might
>>> not work although perhaps not a solution.
>>>
>>> Rolf
>>>
>>> Pallab Datta wrote:
>>>
 Hi Rolf,

 Do i need to configure openmpi with some specific options apart from
 --enable-heterogeneous..?
 I am currently using
 ./configure --prefix=/usr/local/ --enable-heterogeneous
 --disable-static
 --enable-shared --enable-debug

 on both ends...is the above correct..?! Please let me know.
 thanks and regards,
 pallab



> Hi:
> I 

Re: [OMPI users] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless

2009-09-22 Thread Pallab Datta
Is this a bug running open-mpi over heterogeneous environments (between a
mac and linux) over wireless links.
Please suggest what needs to be done or what I am missing.?!
Any clues as to how to debug this will be of great help.
thanks and regards, pallab

> Hi Rolf,
>
> I ran the following:
>
> pallabdatta$ /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca
> btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca
> btl_tcp_if_include en0,wlan0 -np 2 -hetero -H localhost,10.11.14.205
> /tmp/hello
>
> [fuji.local:02267] mca: base: components_open: Looking for btl components
> [fuji.local:02267] mca: base: components_open: opening btl components
> [fuji.local:02267] mca: base: components_open: found loaded component self
> [fuji.local:02267] mca: base: components_open: component self has no
> register function
> [fuji.local:02267] mca: base: components_open: component self open
> function successful
> [fuji.local:02267] mca: base: components_open: found loaded component sm
> [fuji.local:02267] mca: base: components_open: component sm has no
> register function
> [fuji.local:02267] mca: base: components_open: component sm open function
> successful
> [fuji.local:02267] mca: base: components_open: found loaded component tcp
> [fuji.local:02267] mca: base: components_open: component tcp has no
> register function
> [fuji.local:02267] mca: base: components_open: component tcp open function
> successful
> [fuji.local:02267] select: initializing btl component self
> [fuji.local:02267] select: init of component self returned success
> [fuji.local:02267] select: initializing btl component sm
> [fuji.local:02267] select: init of component sm returned success
> [fuji.local:02267] select: initializing btl component tcp
> [fuji.local][[59424,1],0][btl_tcp_component.c:468:mca_btl_tcp_component_create_instances]
> invalid interface "wlan0"
> [fuji.local:02267] select: init of component tcp returned success
> [apex-backpack:31956] mca: base: components_open: Looking for btl
> components
> [apex-backpack:31956] mca: base: components_open: opening btl components
> [apex-backpack:31956] mca: base: components_open: found loaded component
> self
> [apex-backpack:31956] mca: base: components_open: component self has no
> register function
> [apex-backpack:31956] mca: base: components_open: component self open
> function successful
> [apex-backpack:31956] mca: base: components_open: found loaded component
> sm
> [apex-backpack:31956] mca: base: components_open: component sm has no
> register function
> [apex-backpack:31956] mca: base: components_open: component sm open
> function successful
> [apex-backpack:31956] mca: base: components_open: found loaded component
> tcp
> [apex-backpack:31956] mca: base: components_open: component tcp has no
> register function
> [apex-backpack:31956] mca: base: components_open: component tcp open
> function successful
> [apex-backpack:31956] select: initializing btl component self
> [apex-backpack:31956] select: init of component self returned success
> [apex-backpack:31956] select: initializing btl component sm
> [apex-backpack:31956] select: init of component sm returned success
> [apex-backpack:31956] select: initializing btl component tcp
> [apex-backpack][[59424,1],1][btl_tcp_component.c:468:mca_btl_tcp_component_create_instances]
> invalid interface "en0"
> [apex-backpack:31956] select: init of component tcp returned success
> Process 0 on fuji.local out of 2
> Process 1 on apex-backpack out of 2
> [apex-backpack:31956] btl: tcp: attempting to connect() to address
> 10.11.14.203 on port 9360
>
>
>
> It launches the processes on both ends and then it hangs at the send
> receive part..!!
> What is the other thing that you were mentioning which makes you think
> that its not working?!?
> Please suggest..
> --regards, pallab
>
>
>
>> The -enable-heterogeneous should do the trick.  And to answer the
>> previous question, yes, put both of the interfaces in the include list.
>>
>> --mca btl_tcp_if_include en0,wlan0
>>
>> If that does not work, then I may have one other thought why it might
>> not work although perhaps not a solution.
>>
>> Rolf
>>
>> Pallab Datta wrote:
>>> Hi Rolf,
>>>
>>> Do i need to configure openmpi with some specific options apart from
>>> --enable-heterogeneous..?
>>> I am currently using
>>> ./configure --prefix=/usr/local/ --enable-heterogeneous
>>> --disable-static
>>> --enable-shared --enable-debug
>>>
>>> on both ends...is the above correct..?! Please let me know.
>>> thanks and regards,
>>> pallab
>>>
>>>
 Hi:
 I assume if you wait several minutes than your program will actually
 time out, yes?  I guess I have two suggestions. First, can you run a
 non-MPI job using the wireless?  Something like hostname?  Secondly,
 you
 may want to specify the specific interfaces you want it to use on the
 two machines.  You can do that via the "--mca btl_tcp_if_include"
 run-time parameter.  Just list the ones that you expect it to use.

Re: [OMPI users] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless

2009-09-22 Thread Pallab Datta
Hi Rolf,

I ran the following:

pallabdatta$ /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca
btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca
btl_tcp_if_include en0,wlan0 -np 2 -hetero -H localhost,10.11.14.205
/tmp/hello

[fuji.local:02267] mca: base: components_open: Looking for btl components
[fuji.local:02267] mca: base: components_open: opening btl components
[fuji.local:02267] mca: base: components_open: found loaded component self
[fuji.local:02267] mca: base: components_open: component self has no
register function
[fuji.local:02267] mca: base: components_open: component self open
function successful
[fuji.local:02267] mca: base: components_open: found loaded component sm
[fuji.local:02267] mca: base: components_open: component sm has no
register function
[fuji.local:02267] mca: base: components_open: component sm open function
successful
[fuji.local:02267] mca: base: components_open: found loaded component tcp
[fuji.local:02267] mca: base: components_open: component tcp has no
register function
[fuji.local:02267] mca: base: components_open: component tcp open function
successful
[fuji.local:02267] select: initializing btl component self
[fuji.local:02267] select: init of component self returned success
[fuji.local:02267] select: initializing btl component sm
[fuji.local:02267] select: init of component sm returned success
[fuji.local:02267] select: initializing btl component tcp
[fuji.local][[59424,1],0][btl_tcp_component.c:468:mca_btl_tcp_component_create_instances]
invalid interface "wlan0"
[fuji.local:02267] select: init of component tcp returned success
[apex-backpack:31956] mca: base: components_open: Looking for btl components
[apex-backpack:31956] mca: base: components_open: opening btl components
[apex-backpack:31956] mca: base: components_open: found loaded component self
[apex-backpack:31956] mca: base: components_open: component self has no
register function
[apex-backpack:31956] mca: base: components_open: component self open
function successful
[apex-backpack:31956] mca: base: components_open: found loaded component sm
[apex-backpack:31956] mca: base: components_open: component sm has no
register function
[apex-backpack:31956] mca: base: components_open: component sm open
function successful
[apex-backpack:31956] mca: base: components_open: found loaded component tcp
[apex-backpack:31956] mca: base: components_open: component tcp has no
register function
[apex-backpack:31956] mca: base: components_open: component tcp open
function successful
[apex-backpack:31956] select: initializing btl component self
[apex-backpack:31956] select: init of component self returned success
[apex-backpack:31956] select: initializing btl component sm
[apex-backpack:31956] select: init of component sm returned success
[apex-backpack:31956] select: initializing btl component tcp
[apex-backpack][[59424,1],1][btl_tcp_component.c:468:mca_btl_tcp_component_create_instances]
invalid interface "en0"
[apex-backpack:31956] select: init of component tcp returned success
Process 0 on fuji.local out of 2
Process 1 on apex-backpack out of 2
[apex-backpack:31956] btl: tcp: attempting to connect() to address
10.11.14.203 on port 9360



It launches the processes on both ends and then it hangs at the send
receive part..!!
What is the other thing that you were mentioning which makes you think
that its not working?!?
Please suggest..
--regards, pallab



> The -enable-heterogeneous should do the trick.  And to answer the
> previous question, yes, put both of the interfaces in the include list.
>
> --mca btl_tcp_if_include en0,wlan0
>
> If that does not work, then I may have one other thought why it might
> not work although perhaps not a solution.
>
> Rolf
>
> Pallab Datta wrote:
>> Hi Rolf,
>>
>> Do i need to configure openmpi with some specific options apart from
>> --enable-heterogeneous..?
>> I am currently using
>> ./configure --prefix=/usr/local/ --enable-heterogeneous --disable-static
>> --enable-shared --enable-debug
>>
>> on both ends...is the above correct..?! Please let me know.
>> thanks and regards,
>> pallab
>>
>>
>>> Hi:
>>> I assume if you wait several minutes than your program will actually
>>> time out, yes?  I guess I have two suggestions. First, can you run a
>>> non-MPI job using the wireless?  Something like hostname?  Secondly,
>>> you
>>> may want to specify the specific interfaces you want it to use on the
>>> two machines.  You can do that via the "--mca btl_tcp_if_include"
>>> run-time parameter.  Just list the ones that you expect it to use.
>>>
>>> Also, this is not right - "--mca OMPI_mca_mpi_preconnect_all 1"  It
>>> should be --mca mpi_preconnect_mpi 1 if you want to do the connection
>>> during MPI_Init.
>>>
>>> Rolf
>>>
>>> Pallab Datta wrote:
>>>
 The following is the error dump

 fuji:src pallabdatta$ /usr/local/bin/mpirun --mca btl_tcp_port_min_v4
 36900 -mca btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca
 btl
 tcp,self --mca 

Re: [OMPI users] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless

2009-09-22 Thread Pallab Datta
Hi Rolf,

Thanks for the suggestions. I will try it. I can run a non-mpi program
over wireless.
My mac's ethernet interface is en0, and my linux's wireless is wlan0..can
I mention both in the --mca btl__tcp_if_include option?!

thanks a lot in advance,
regards, pallab


> Hi:
> I assume if you wait several minutes than your program will actually
> time out, yes?  I guess I have two suggestions. First, can you run a
> non-MPI job using the wireless?  Something like hostname?  Secondly, you
> may want to specify the specific interfaces you want it to use on the
> two machines.  You can do that via the "--mca btl_tcp_if_include"
> run-time parameter.  Just list the ones that you expect it to use.
>
> Also, this is not right - "--mca OMPI_mca_mpi_preconnect_all 1"  It
> should be --mca mpi_preconnect_mpi 1 if you want to do the connection
> during MPI_Init.
>
> Rolf
>>
>
>
> Pallab Datta wrote:
>> The following is the error dump
>>
>> fuji:src pallabdatta$ /usr/local/bin/mpirun --mca btl_tcp_port_min_v4
>> 36900 -mca btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca btl
>> tcp,self --mca OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H
>> localhost,10.11.14.205 /tmp/hello
>> [fuji.local:01316] mca: base: components_open: Looking for btl
>> components
>> [fuji.local:01316] mca: base: components_open: opening btl components
>> [fuji.local:01316] mca: base: components_open: found loaded component
>> self
>> [fuji.local:01316] mca: base: components_open: component self has no
>> register function
>> [fuji.local:01316] mca: base: components_open: component self open
>> function successful
>> [fuji.local:01316] mca: base: components_open: found loaded component
>> tcp
>> [fuji.local:01316] mca: base: components_open: component tcp has no
>> register function
>> [fuji.local:01316] mca: base: components_open: component tcp open
>> function
>> successful
>> [fuji.local:01316] select: initializing btl component self
>> [fuji.local:01316] select: init of component self returned success
>> [fuji.local:01316] select: initializing btl component tcp
>> [fuji.local:01316] select: init of component tcp returned success
>> [apex-backpack:04753] mca: base: components_open: Looking for btl
>> components
>> [apex-backpack:04753] mca: base: components_open: opening btl components
>> [apex-backpack:04753] mca: base: components_open: found loaded component
>> self
>> [apex-backpack:04753] mca: base: components_open: component self has no
>> register function
>> [apex-backpack:04753] mca: base: components_open: component self open
>> function successful
>> [apex-backpack:04753] mca: base: components_open: found loaded component
>> tcp
>> [apex-backpack:04753] mca: base: components_open: component tcp has no
>> register function
>> [apex-backpack:04753] mca: base: components_open: component tcp open
>> function successful
>> [apex-backpack:04753] select: initializing btl component self
>> [apex-backpack:04753] select: init of component self returned success
>> [apex-backpack:04753] select: initializing btl component tcp
>> [apex-backpack:04753] select: init of component tcp returned success
>> Process 0 on fuji.local out of 2
>> Process 1 on apex-backpack out of 2
>> [apex-backpack:04753] btl: tcp: attempting to connect() to address
>> 10.11.14.203 on port 9360
>>
>>
>>
>>
>>
>>> Hi
>>>
>>> I am trying to run open-mpi 1.3.3. between a linux box running ubuntu
>>> server v.9.04 and a Macintosh. I have configured openmpi with the
>>> following options.:
>>> ./configure --prefix=/usr/local/ --enable-heterogeneous
>>> --disable-shared
>>> --enable-static
>>>
>>> When both the machines are connected to the network via ethernet cables
>>> openmpi works fine.
>>>
>>> But when I switch the linux box to a wireless adapter i can reach
>>> (ping)
>>> the macintosh
>>> but openmpi hangs on a hello world program.
>>>
>>> I ran :
>>>
>>> /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca
>>> btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca
>>> OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H localhost,10.11.14.205
>>> /tmp/back
>>>
>>> it hangs on a send receive function between the two ends. All my
>>> firewalls
>>> are turned off at the macintosh end. PLEASE HELP ASAP>
>>> regards,
>>> pallab
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> --
>
> =
> rolf.vandeva...@sun.com
> 781-442-3043
> =
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] How to create multi-thread parallel program using thread-safe send and recv?

2009-09-22 Thread guosong

Thanks for responding. I used a linux cluster. I think I would like to create a 
model that is multithreaded and each thread can make MPI calls. I attached test 
code as follow. It has two pthreads and there are MPI calls in both of those 
two threads. In the main function, there are also MPI calls. Should I use a 
full multithreading? Thanks again.





#include 
#include 
#include 
#include 
#include 
#include "mpi.h"
using namespace std;

pthread_mutex_t _dealmutex;
pthread_mutex_t _dealmutex1;
pthread_mutex_t _dealmutex2;

void* backID(void* arg)
{
 int myid;
 pthread_mutex_init(&_dealmutex1, NULL);
 stringstream RANK;
 MPI_Comm_rank(MPI_COMM_WORLD, );
 RANK << myid;
 cout << myid << " create background ID" << endl;
 int v;
 MPI_Status status;
 MPI_Request  requ1, requ2;
 int m;
 int x, y;
 int count = 0;
 string filename("f_");
 filename += RANK.str();
 filename += "_backID.txt";
 fstream fout(filename.c_str(), ios::out);
 if(!fout)
 {
  cout << "can not create the file " << filename << endl;
  fout.close();
  exit(1);
 }
 while(true)
 {
  MPI_Irecv(, 1, MPI_INT, MPI_ANY_SOURCE, 222, MPI_COMM_WORLD, );
  MPI_Wait(, );
  //fout << myid << " recv from " << status.MPI_SOURCE << " m = " << m << " 
with tag 222" << endl;
  //pthread_mutex_lock(&_dealmutex1);
  //cout << "BACKID_REV:" << myid << " recv from " << status.MPI_SOURCE << " m 
= " << m << " with tag 222" << endl;
  fout << "BACKID_REV:" << myid << " recv from " << status.MPI_SOURCE << " m = 
" << m << " with tag 222" << endl;
  //fflush(stdout);
  fout.flush();

  //pthread_mutex_unlock(&_dealmutex1);
  //m++;
  MPI_Send(, 1, MPI_INT, status.MPI_SOURCE, 333, MPI_COMM_WORLD);
  //MPI_Isend(, 1, MPI_INT, status.MPI_SOURCE, 333, MPI_COMM_WORLD, );
  //pthread_mutex_lock(&_dealmutex1);
  //fout << myid << " replies " << status.MPI_SOURCE << " m = " << m << endl;
  //cout << "BACKID_SEND:" << myid << " replies " << status.MPI_SOURCE << " m = 
" << m << endl;
  fout << "BACKID_SEND:" << myid << " replies " << status.MPI_SOURCE << " m = " 
<< m << endl;
  //fflush(stdout);
  fout.flush();
  //pthread_mutex_unlock(&_dealmutex);
  count++;
  //pthread_mutex_unlock(&_dealmutex1);
  if(count == 50)
  {
   fout << "***backID FINISHED IN " << myid << "" << endl;
   fout.flush();
   fout.close();
   pthread_exit(NULL);
   return 0;
  }
 };
}


void* backRecv(void* arg)
{
 int myid;
 pthread_mutex_init(&_dealmutex2, NULL);
 stringstream RANK;
 MPI_Status status;
 MPI_Request  requ2;
 MPI_Comm_rank(MPI_COMM_WORLD, );
 RANK << myid;
 cout << myid << " create background message recv" << endl;
 int x, y;
 //char c;
 int m;
 int count = 0;
 string filename("f_");
 filename += RANK.str();
 filename += "_backRecv.txt";
 fstream fout(filename.c_str(), ios::out);
 if(!fout)
 {
  cout << "can not create the file " << filename << endl;
  fout.close();
  exit(1);
 }

 while(true)
 {
  MPI_Irecv(, 1, MPI_INT, MPI_ANY_SOURCE, 333, MPI_COMM_WORLD, );
  MPI_Wait(, );  
  //pthread_mutex_lock(&_dealmutex2);
  fout << "BACKREV:" << myid << " recv from " << status.MPI_SOURCE << " m = " 
<< m << " with tag 333" << endl;
  fout.flush();
  //cout << "BACKREV:" << myid << " recv from " << status.MPI_SOURCE << " m = " 
<< m << " with tag 333" << endl;
  //fflush(stdout);
  //pthread_mutex_unlock(&_dealmutex);
  //pthread_mutex_lock(&_dealmutex);
  count++;
  //pthread_mutex_unlock(&_dealmutex2);
  if(count == 50)
  {
   fout << "***backRecv FINISHED IN " << myid << "" << endl;
   fout.flush();
   fout.close();
   pthread_exit(NULL);
   return 0;
  } 
 };
}

int main(int argc, char **argv) 
{
 int myid = 0;
 int nprocs = 0;
 pthread_t pt1 = 0;
pthread_t pt2 = 0;;
 int pret1 = 0;
 int pret2 = 0;
 int i = 0, j = 0, t = 0;
 //MPI_Status status;
 MPI_Request  requ1;
 MPI_Init(,);
   MPI_Comm_size(MPI_COMM_WORLD,);
   MPI_Comm_rank(MPI_COMM_WORLD,); 
 pthread_mutex_init(&_dealmutex, NULL);

 for(i=0; i<50; ++i)
 {
  t = (myid + 1) * i;
  MPI_Isend(, 1, MPI_INT, (myid+1)%nprocs, 222, MPI_COMM_WORLD, );
  //MPI_Sendrecv(, 1, MPI_INT, (myid+1)%nprocs, 222, , 1, MPI_INT, 
(myid+1)%nprocs, 333, MPI_COMM_WORLD, );
  cout << "MAIN:" << myid << " sends to "<< (myid+1)%nprocs << " " << myid << 
endl;
  fflush(stdout);
 }
 pret1 = pthread_create(, NULL, backRecv, NULL);
 if(pret1 != 0)
{
cout << myid << "backRecv Thread Create Failed." << endl;
exit(1);
}
 pret2 = pthread_create(, NULL, backID, NULL);
 if(pret2 != 0)
 {
cout << myid << "backID Thread Create Failed." << endl;
exit(1);
 }
 //for(i=0; i<10; ++i)
 //{
 // c += i;
 // MPI_Send(, 1, MPI_CHAR, (myid+1)%nprocs, 111, MPI_COMM_WORLD);
 // cout << myid << " send " << (char)c << " to " << (myid+1)%nprocs << endl;
 //}
 pthread_join(pt2, NULL);
 cout << "***THREAD 2 SUCESS!" << endl;
 pthread_join(pt1, NULL);
 cout << "***THREAD 1 SUCESS!" << endl;
 MPI_Finalize();
 cout << "***MAIN SUCESS!" << 

Re: [OMPI users] How to create multi-thread parallel program using thread-safe send and recv?

2009-09-22 Thread Eugene Loh




guosong wrote:

  Hi
all,
I would like to write a multi-thread parallel program. I used pthread.
Basicly, I want to create two background threads besides  the main
thread(process). For example, if I use "-np 4", the program should have
4 main processes on four processors and two background threads for each
main process. So there should be 8 threads totally.
Wouldn't there be 4 main threads and 8 "slave" threads for a total of
12 threads?  Anyhow, doesn't matter.

I'm not sure where you're starting, but you should at least have a
basic understanding of the different sorts of multithreaded programming
models in MPI.  One is that each process is single threaded.  Another
is the processes are multithreaded, but only the main thread makes MPI
calls.  Another is multithreaded, but only one MPI call at a time. 
Finally, there can be full multithreading.  You have to decide which of
these programming models you want and which is supported by your MPI
(or, if OMPI, how OMPI was built).

For more information, try the MPI_Init_thread() man page or
http://www.mpi-forum.org./docs/mpi21-report.pdf ... see Section 12.4 on
"MPI and Threads".
I wrote a test program and it worked unpredictable.
Sometimes I got the result I want, but sometimes the program got
segmentation fault. I used MPI_Isend and MPI_Irecv for sending and
recving. I do not know why? I attached the error message as follow:
 
[cheetah:29780] *** Process received signal ***
[cheetah:29780] Signal: Segmentation fault (11)
[cheetah:29780] Signal code: Address not mapped (1)
[cheetah:29780] Failing at address: 0x10
[cheetah:29779] *** Process received signal ***
[cheetah:29779] Signal: Segmentation fault (11)
[cheetah:29779] Signal code: Address not mapped (1)
[cheetah:29779] Failing at address: 0x10
[cheetah:29780] [ 0] /lib64/libpthread.so.0 [0x334b00de70]
[cheetah:29780] [ 1] /act/openmpi/gnu/lib/openmpi/mca_btl_sm.so
[0x2b90e1227940]
[cheetah:29780] [ 2] /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so
[0x2b90e05d61ca]
[cheetah:29780] [ 3] /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so
[0x2b90e05cac86]
[cheetah:29780] [ 4] /act/openmpi/gnu/lib/libmpi.so.0(PMPI_Send+0x13d)
[0x2b90dde7271d]
[cheetah:29780] [ 5] pt_muti(_Z6backIDPv+0x29b) [0x409929]
[cheetah:29780] [ 6] /lib64/libpthread.so.0 [0x334b0062f7]
[cheetah:29780] [ 7] /lib64/libc.so.6(clone+0x6d) [0x334! a4d1e3d]
[cheetah:29780] *** End of error message ***
[cheetah:29779] [ 0] /lib64/libpthread.so.0 [0x334b00de70]
[cheetah:29779] [ 1] /act/openmpi/gnu/lib/openmpi/mca_btl_sm.so
[0x2b39785c0940]
[cheetah:29779] [ 2] /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so
[0x2b397796f1ca]
[cheetah:29779] [ 3] /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so
[0x2b3977963c86]
[cheetah:29779] [ 4] /act/openmpi/gnu/lib/libmpi.so.0(PMPI_Send+0x13d)
[0x2b397520b71d]
[cheetah:29779] [ 5] pt_muti(_Z6backIDPv+0x29b) [0x409929]
[cheetah:29779] [ 6] /lib64/libpthread.so.0 [0x334b0062f7]
[cheetah:29779] [ 7] /lib64/libc.so.6(clone+0x6d) [0x334a4d1e3d]
[cheetah:29779] *** End of error message ***
  
 
I used gdb to "bt" the error and I got :
 Program terminated with signal 11, Segmentation fault.
#0  0x2b90e1227940 in mca_btl_sm_alloc ()
   from /act/openmpi/gnu/lib/openmpi/mca_btl_sm.so
(gdb) bt
#0  0x2b90e1227940 in mca_btl_sm_alloc ()
   from /act/openmpi/gnu/lib/openmpi/mca_btl_sm.so
#1  0x2b90e05d61ca in mca_pml_ob1_send_request_start_copy ()
   from /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so
#2  0x2b90e05cac86 in mca_pml_ob1_send ()
   from /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so
#3  0x2b90dde7271d in PMPI_Send () from
/act/openmpi/gnu/lib/libmpi.so.0
#4  0x00409929 in backID (arg=0x0) at pt_muti.cpp:50
#5  0x00334b0062f7 in start_thread () from /lib64/libpthread.so.0
#6  0x00334a4d1e3d in clone () from /lib64/libc.so.6
So can anyone give me some suggestions or advice. Thanks very much.




Re: [OMPI users] MPI_Irecv segmentation fault

2009-09-22 Thread jody
Did you also change the "" to buffer in your MPI_Send call?

Jody

On Tue, Sep 22, 2009 at 1:38 PM, Everette Clemmer  wrote:
> Hmm, tried changing MPI_Irecv( ) to MPI_Irecv( buffer...)
> and still no luck. Stack trace follows if that's helpful:
>
> prompt$ mpirun -np 2 ./display_test_debug
> Sending 'q' from node 0 to node 1
> [COMPUTER:50898] *** Process received signal ***
> [COMPUTER:50898] Signal: Segmentation fault (11)
> [COMPUTER:50898] Signal code:  (0)
> [COMPUTER:50898] Failing at address: 0x0
> [COMPUTER:50898] [ 0] 2   libSystem.B.dylib
> 0x7fff87e280aa _sigtramp + 26
> [COMPUTER:50898] [ 1] 3   ???
> 0x 0x0 + 0
> [COMPUTER:50898] [ 2] 4   GLUT
> 0x000100024a21 glutMainLoop + 261
> [COMPUTER:50898] [ 3] 5   display_test_debug
> 0x00011444 xsMainLoop + 67
> [COMPUTER:50898] [ 4] 6   display_test_debug
> 0x00011335 main + 59
> [COMPUTER:50898] [ 5] 7   display_test_debug
> 0x00010d9c start + 52
> [COMPUTER:50898] [ 6] 8   ???
> 0x0001 0x0 + 1
> [COMPUTER:50898] *** End of error message ***
> mpirun noticed that job rank 0 with PID 50897 on node COMPUTER.local
> exited on signal 15 (Terminated).
> 1 additional process aborted (not shown)
>
> Thanks,
> Everette
>
>
> On Tue, Sep 22, 2009 at 2:28 AM, Ake Sandgren  
> wrote:
>> On Mon, 2009-09-21 at 19:26 -0400, Everette Clemmer wrote:
>>> Hey all,
>>>
>>> I'm getting a segmentation fault when I attempt to receive a single
>>> character via MPI_Irecv. Code follows:
>>>
>>> void recv_func() {
>>>               if( !MASTER ) {
>>>                       char            buffer[ 1 ];
>>>                       int             flag;
>>>                       MPI_Request request;
>>>                       MPI_Status      status;
>>>
>>>                       MPI_Irecv( , 1, MPI_CHAR, 0, MPI_ANY_TAG, 
>>> MPI_COMM_WORLD, );
>>
>> It should be MPI_Irecv(buffer, 1, ...)
>>
>>> The segfault disappears if I comment out the MPI_Irecv call in
>>> recv_func so I'm assuming that there's something wrong with the
>>> parameters that I'm passing to it. Thoughts?
>>
>> --
>> Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
>> Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
>> Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> --
> - Everette
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



[OMPI users] MPI_Comm_spawn query

2009-09-22 Thread Blesson Varghese
Hi,



I am fairly new to MPI.I have a few queries regarding spawning processes
that I am listing below:

a.   How can processes send data to a spawned process?

b.  Can any process (that is not a parent process) send data to a
spawned process?

c.   Can MPI_Send or MPI_Recv be used to communicate with a spawned
process?

d.  Would it be possible in MPI to tell which processor of a cluster a
process should be spawned?



Looking forward to your reply. Would much appreciate if you could please
include code snippets for the same.



Many thanks and best regards,

Blesson. 





Re: [OMPI users] MPI_Irecv segmentation fault

2009-09-22 Thread Everette Clemmer
Hmm, tried changing MPI_Irecv( ) to MPI_Irecv( buffer...)
and still no luck. Stack trace follows if that's helpful:

prompt$ mpirun -np 2 ./display_test_debug
Sending 'q' from node 0 to node 1
[COMPUTER:50898] *** Process received signal ***
[COMPUTER:50898] Signal: Segmentation fault (11)
[COMPUTER:50898] Signal code:  (0)
[COMPUTER:50898] Failing at address: 0x0
[COMPUTER:50898] [ 0] 2   libSystem.B.dylib
0x7fff87e280aa _sigtramp + 26
[COMPUTER:50898] [ 1] 3   ???
0x 0x0 + 0
[COMPUTER:50898] [ 2] 4   GLUT
0x000100024a21 glutMainLoop + 261
[COMPUTER:50898] [ 3] 5   display_test_debug
0x00011444 xsMainLoop + 67
[COMPUTER:50898] [ 4] 6   display_test_debug
0x00011335 main + 59
[COMPUTER:50898] [ 5] 7   display_test_debug
0x00010d9c start + 52
[COMPUTER:50898] [ 6] 8   ???
0x0001 0x0 + 1
[COMPUTER:50898] *** End of error message ***
mpirun noticed that job rank 0 with PID 50897 on node COMPUTER.local
exited on signal 15 (Terminated).
1 additional process aborted (not shown)

Thanks,
Everette


On Tue, Sep 22, 2009 at 2:28 AM, Ake Sandgren  wrote:
> On Mon, 2009-09-21 at 19:26 -0400, Everette Clemmer wrote:
>> Hey all,
>>
>> I'm getting a segmentation fault when I attempt to receive a single
>> character via MPI_Irecv. Code follows:
>>
>> void recv_func() {
>>               if( !MASTER ) {
>>                       char            buffer[ 1 ];
>>                       int             flag;
>>                       MPI_Request request;
>>                       MPI_Status      status;
>>
>>                       MPI_Irecv( , 1, MPI_CHAR, 0, MPI_ANY_TAG, 
>> MPI_COMM_WORLD, );
>
> It should be MPI_Irecv(buffer, 1, ...)
>
>> The segfault disappears if I comment out the MPI_Irecv call in
>> recv_func so I'm assuming that there's something wrong with the
>> parameters that I'm passing to it. Thoughts?
>
> --
> Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
> Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
> Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
- Everette



Re: [OMPI users] MPI_Irecv segmentation fault

2009-09-22 Thread Ake Sandgren
On Mon, 2009-09-21 at 19:26 -0400, Everette Clemmer wrote:
> Hey all,
> 
> I'm getting a segmentation fault when I attempt to receive a single
> character via MPI_Irecv. Code follows:
> 
> void recv_func() {
>   if( !MASTER ) {
>   charbuffer[ 1 ];
>   int flag;
>   MPI_Request request;
>   MPI_Status  status;
> 
>   MPI_Irecv( , 1, MPI_CHAR, 0, MPI_ANY_TAG, 
> MPI_COMM_WORLD, );

It should be MPI_Irecv(buffer, 1, ...)

> The segfault disappears if I comment out the MPI_Irecv call in
> recv_func so I'm assuming that there's something wrong with the
> parameters that I'm passing to it. Thoughts?

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



[OMPI users] error in ompi-checkpoint

2009-09-22 Thread Mallikarjuna Shastry




[root@localhost examples]# mpirun -np 4 -am ft-enable-cr ./res

2   2   2   2   2   2   2   2   2   2
2   2   2   2   2   2   2   2   2   2
2   2   2   2   2   2   2   2   2   2
2   2   2   2   2   2   2   2   2   2
2   2   2   2   2   2   2   2   2   2
--
Error: The process with PID 19735 is not checkpointable.
   This could be due to one of the following:
- An application with this PID doesn't currently exist
- The application with this PID isn't checkpointable
- The application with this PID isn't an OPAL application.
   We were looking for the named files:
 /tmp/opal_cr_prog_write.19735
 /tmp/opal_cr_prog_read.19735
--
[localhost.localdomain:19733] local) Error: Unable to initiate the handshake 
with peer [[17893,1],1]. -1
[localhost.localdomain:19733] [[17893,0],0] ORTE_ERROR_LOG: Error in file 
snapc_full_global.c at line 567
[localhost.localdomain:19733] [[17893,0],0] ORTE_ERROR_LOG: Error in file 
snapc_full_global.c at line 1054
2   2   2   2   2   2   2   2   2   2
2   2   2   2   2   2   2   2   2   2
2   2   2   2   2   2   2   2   2   2
2   2   2   2   2   2   2   2   2   2







Note: pid of mpirun is 19733





[OMPI users] How to create multi-thread parallel program using thread-safe send and recv?

2009-09-22 Thread guosong

Hi all,

I would like to write a multi-thread parallel program. I used pthread. Basicly, 
I want to create two background threads besides  the main thread(process). For 
example, if I use "-np 4", the program should have 4 main processes on four 
processors and two background threads for each main process. So there should be 
8 threads totally. I wrote a test program and it worked unpredictable. 
Sometimes I got the result I want, but sometimes the program got segmentation 
fault. I used MPI_Isend and MPI_Irecv for sending and recving. I do not know 
why? I attached the error message as follow:



[cheetah:29780] *** Process received signal ***
[cheetah:29780] Signal: Segmentation fault (11)
[cheetah:29780] Signal code: Address not mapped (1)
[cheetah:29780] Failing at address: 0x10
[cheetah:29779] *** Process received signal ***
[cheetah:29779] Signal: Segmentation fault (11)
[cheetah:29779] Signal code: Address not mapped (1)
[cheetah:29779] Failing at address: 0x10
[cheetah:29780] [ 0] /lib64/libpthread.so.0 [0x334b00de70]
[cheetah:29780] [ 1] /act/openmpi/gnu/lib/openmpi/mca_btl_sm.so [0x2b90e1227940]
[cheetah:29780] [ 2] /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so 
[0x2b90e05d61ca]
[cheetah:29780] [ 3] /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so 
[0x2b90e05cac86]
[cheetah:29780] [ 4] /act/openmpi/gnu/lib/libmpi.so.0(PMPI_Send+0x13d) 
[0x2b90dde7271d]
[cheetah:29780] [ 5] pt_muti(_Z6backIDPv+0x29b) [0x409929]
[cheetah:29780] [ 6] /lib64/libpthread.so.0 [0x334b0062f7]
[cheetah:29780] [ 7] /lib64/libc.so.6(clone+0x6d) [0x334a4d1e3d]
[cheetah:29780] *** End of error message ***
[cheetah:29779] [ 0] /lib64/libpthread.so.0 [0x334b00de70]
[cheetah:29779] [ 1] /act/openmpi/gnu/lib/openmpi/mca_btl_sm.so [0x2b39785c0940]
[cheetah:29779] [ 2] /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so 
[0x2b397796f1ca]
[cheetah:29779] [ 3] /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so 
[0x2b3977963c86]
[cheetah:29779] [ 4] /act/openmpi/gnu/lib/libmpi.so.0(PMPI_Send+0x13d) 
[0x2b397520b71d]
[cheetah:29779] [ 5] pt_muti(_Z6backIDPv+0x29b) [0x409929]
[cheetah:29779] [ 6] /lib64/libpthread.so.0 [0x334b0062f7]
[cheetah:29779] [ 7] /lib64/libc.so.6(clone+0x6d) [0x334a4d1e3d]
[cheetah:29779] *** End of error message ***




I used gdb to "bt" the error and I got :

 Program terminated with signal 11, Segmentation fault.
#0  0x2b90e1227940 in mca_btl_sm_alloc ()
   from /act/openmpi/gnu/lib/openmpi/mca_btl_sm.so
(gdb) bt
#0  0x2b90e1227940 in mca_btl_sm_alloc ()
   from /act/openmpi/gnu/lib/openmpi/mca_btl_sm.so
#1  0x2b90e05d61ca in mca_pml_ob1_send_request_start_copy ()
   from /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so
#2  0x2b90e05cac86 in mca_pml_ob1_send ()
   from /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so
#3  0x2b90dde7271d in PMPI_Send () from /act/openmpi/gnu/lib/libmpi.so.0
#4  0x00409929 in backID (arg=0x0) at pt_muti.cpp:50
#5  0x00334b0062f7 in start_thread () from /lib64/libpthread.so.0
#6  0x00334a4d1e3d in clone () from /lib64/libc.so.6
So can anyone give me some suggestions or advice. Thanks very much.  

_
上Windows Live 中国首页,下载最新版Messenger!
http://www.windowslive.cn