Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?
Hi, Jeff: I wish I had your problems reproducing this. This problem apparently rears its head when OpenMPI is compiled with the intel compilers, as well, but only ~1% of the time. Unfortunately, we have users who launch ~1400 single-node jobs at a go. So they see on order a dozen or two jobs hang per suite of simulations when using the defaults, but their problem goes away when they use -mca btl self,tcp, or when they use sm but set the number of fifos to np-1. At first I had assumed it was a new-ish-architecture thing, as we first saw the problem on the Nehalem Xeon E5540 nodes, but the sample program hangs in exactly the same way on a Harpertown (E5430) machine as well. So I've been assuming that this is a real problem that for whatever reason is just exposed more with this particular version of this particular compiler. I'd love to be wrong and for it to be something strange but easily changed in our environment that is causing this. Running with your suggested test change, eg leftneighbour = rank-1 if (leftneighbour .eq. -1) then ! leftneighbour = nprocs-1 leftneighbour = MPI_PROC_NULL endif rightneighbour = rank+1 if (rightneighbour .eq. nprocs) then ! rightneighbour = 0 rightneighbour = MPI_PROC_NULL endif like so: mpirun -np 6 -mca btl self,sm,tcp ./diffusion-mpi I do seem to get different behaviour. With OpenMPI 1.3.2, the program frequently runs to completion, but when it does so it hangs at the end, which hadn't happened before -- attaching gdb to a process tells me that it's hanging in mpi_finalize; (gdb) where #0 0x2b3635ecb51f in poll () from /lib64/libc.so.6 #1 0x2b3634bd87c1 in poll_dispatch () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0 #2 0x2b3634bd7659 in opal_event_base_loop () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0 #3 0x2b3634bcc189 in opal_progress () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libopen-pal.so.0 #4 0x2b3636d7cf15 in barrier () from /scinet/gpc/mpi/openmpi/ 1.3.2-gcc-v4.4.0-ofed/lib/openmpi/mca_grpcomm_bad.so #5 0x2b363470158b in ompi_mpi_finalize () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0 #6 0x2b36344bb529 in pmpi_finalize__ () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0 #7 0x00400f99 in MAIN__ () #8 0x00400fda in main (argc=1, argv=0x7fff3e3908c8) at ../../../gcc-4.4.0/libgfortran/fmain.c:21 (gdb) The rest of the time (maybe 1/4 of the time?) it hangs mid-run, in the sendrecv: (gdb) where #0 0x2b2bb44b4230 in mca_pml_ob1_send () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/openmpi/mca_pml_ob1.so #1 0x2b2baf47d296 in PMPI_Sendrecv () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi.so.0 #2 0x2b2baf215540 in pmpi_sendrecv__ () from /scinet/gpc/mpi/ openmpi/1.3.2-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0 #3 0x00400ea6 in MAIN__ () #4 0x00400fda in main (argc=1, argv=0x7fff62d9b9c8) at ../../../gcc-4.4.0/libgfortran/fmain.c:21 When running with OpenMPI 1.3.3, I get hangs in the program significantly _more_ often with this change than before, typically in the sendrecv again #0 0x2aeb89d6cf2b in mca_btl_sm_component_progress () from / scinet/gpc/mpi/openmpi/1.3.3-gcc-v4.4.0-ofed/lib/openmpi/mca_btl_sm.so #1 0x2aeb849bd14a in opal_progress () from /scinet/gpc/mpi/ openmpi/1.3.3-gcc-v4.4.0-ofed/lib/libopen-pal.so.0 #2 0x2aeb8954f235 in mca_pml_ob1_send () from /scinet/gpc/mpi/ openmpi/1.3.3-gcc-v4.4.0-ofed/lib/openmpi/mca_pml_ob1.so #3 0x2aeb84516586 in PMPI_Sendrecv () from /scinet/gpc/mpi/ openmpi/1.3.3-gcc-v4.4.0-ofed/lib/libmpi.so.0 #4 0x2aeb842ae5b0 in pmpi_sendrecv__ () from /scinet/gpc/mpi/ openmpi/1.3.3-gcc-v4.4.0-ofed/lib/libmpi_f77.so.0 #5 0x00400ea6 in MAIN__ () #6 0x00400fda in main (argc=1, argv=0x7fff12a13068) at ../../../gcc-4.4.0/libgfortran/fmain.c:21 but again occasionally in the finalize, and (unlike with 1.3.2) occasional successful runs through completion. Again, running the program with both versions of openmpi without sm mpirun -np 6 -mca btl self,tcp ./diffusion-mpi or with num_fifos=(np-1): mpirun -np 6 -mca btl self,sm -mca btl_sm_num_fifos 5 ./diffusion-mpi seems to work fine. - Jonathan On 2009-09-22, at 8:52PM, Jeff Squyres wrote: Johnathan -- Sorry for the delay in replying; thanks for posting again. I'm actually unable to replicate your problem. :-( I have a new intel 8 core X5570 box; I'm running at np6 and np8 on both Open MPI 1.3.2 and 1.3.3 and am not seeing the problem you're seeing. I even made your sample program worse -- I made a and b be 100,000 element real arrays (increasing the count args in MPI_SENDRECV to 100,000 as well), and increased nsteps to 150,000,000.
Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless
(only replying to users list) Some suggestions: - MPI seems to startup but the additional TCP connections required for MPI connections seem to be failing / timing out / some other error. - Are you running firewalls between your machines? If so, can you disable them? - I see that you're specifying "--mca btl_tcp_port_min_v4 36900" but one of the debug lines reads: [apex-backpack:31956] btl: tcp: attempting to connect() to address 10.11.14.203 on port 9360 - Try not using the name "localhost", but rather the IP address of the local machine On Sep 22, 2009, at 5:27 PM, Pallab Datta wrote: The following are the ifconfig for both the Mac and the Linux respectively: fuji:openmpi-1.3.3 pallabdatta$ ifconfig lo0: flags=8049mtu 16384 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 inet 127.0.0.1 netmask 0xff00 inet6 ::1 prefixlen 128 gif0: flags=8010 mtu 1280 stf0: flags=0<> mtu 1280 en0: flags=8863 mtu 1500 inet6 fe80::21f:5bff:fe3d:eaac%en0 prefixlen 64 scopeid 0x4 inet 10.11.14.203 netmask 0xf000 broadcast 10.11.15.255 ether 00:1f:5b:3d:ea:ac media: autoselect (100baseTX ) status: active supported media: autoselect 10baseT/UTP 10baseT/UTP 10baseT/UTP 10baseT/UTP 100baseTX 100baseTX 100baseTX 100baseTX 1000baseT 1000baseT 1000baseT en1: flags=8863 mtu 1500 ether 00:1f:5b:3d:ea:ad media: autoselect status: inactive supported media: autoselect 10baseT/UTP 10baseT/UTP 10baseT/UTP 10baseT/UTP 100baseTX 100baseTX 100baseTX 100baseTX 1000baseT 1000baseT 1000baseT fw0: flags=8863 mtu 4078 lladdr 00:22:41:ff:fe:ed:7d:a8 media: autoselect status: inactive supported media: autoselect LINUX: pallabdatta@apex-backpack:~/backpack/src$ ifconfig loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:116 errors:0 dropped:0 overruns:0 frame:0 TX packets:116 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:11788 (11.7 KB) TX bytes:11788 (11.7 KB) wlan0 Link encap:Ethernet HWaddr 00:21:79:c2:54:c7 inet addr:10.11.14.205 Bcast:10.11.14.255 Mask: 255.255.240.0 inet6 addr: fe80::221:79ff:fec2:54c7/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:72531 errors:0 dropped:0 overruns:0 frame:0 TX packets:28894 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:5459312 (5.4 MB) TX bytes:7264193 (7.2 MB) wmaster0 Link encap:UNSPEC HWaddr 00-21-79-C2-54-C7-34-63-00-00-00-00-00-00-00-00 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) The mac is a Two 2.26GHz Quad-Core Intel Xeon Mac Pro and the Linux Box is Ubuntu Server Edition 9.04. The Mac has the ethernet interface to connect to the network and the linux box connects via a wireless adapter (IOGEAR). Please help me any way I can fix this issue. It really needs to work for our project. thanks in advance, regards, pallab My other concern was the following but I am not sure it applies here. If you have multiple interfaces on the node, and they are on the same subnet, then you cannot actually select what IP address to go out of. You can only select the IP address you want to connect to. In these cases, I have seen a hang because we think we are selecting an IP address to go out of, but it actually goes out the other one. Perhaps you can send the User's list the output from "ifconfig" on each of the machines which would show all the interfaces. You need to get the right arguments for ifconfig depending on the OS you are running on. One thought is make sure the ethernet interface is marked down on both boxes if that is possible. Pallab Datta wrote: Any suggestions on to how to debug this further..?? do you think I need to enable any other option besides heterogeneous at the configure proompt.? The -enable-heterogeneous should do the trick. And to answer the previous question, yes, put both of the interfaces in the include list. --mca btl_tcp_if_include en0,wlan0 If that
Re: [OMPI users] How to create multi-thread parallel program using thread-safe send and recv?
This is just a test example. The real project behind it needs to configure like that. > From: te...@chem.gu.se > To: us...@open-mpi.org > Date: Wed, 23 Sep 2009 09:39:22 +1000 > Subject: Re: [OMPI users] How to create multi-thread parallel program using > thread-safe send and recv? > > If you want all threads to communicate via MPI, and your initially > launching multiple parents, I don't really see the advantage of using > threads at all. Why not launch 12 MPI processes? > > On Tue, 2009-09-22 at 10:32 -0700, Eugene Loh wrote: > > guosong wrote: > > > Thanks for responding. I used a linux cluster. I think I would like > > > to create a model that is multithreaded and each thread can make MPI > > > calls. I attached test code as follow. It has two pthreads and there > > > are MPI calls in both of those two threads. In the main function, > > > there are also MPI calls. Should I use a full multithreading? > > I guess so. It seems like the created threads are expected to make > > independent/concurrent message-passing calls. Do read the link I > > sent. You need to convert from MPI_Init to MPI_Init_thread(), asking > > for a full-multithreaded model and checking that you got it. Also > > note in main() that the MPI_Isend() calls should be matched with > > MPI_Wait() or similar calls. I guess the parent thread will sit in > > such calls while the child threads do their own message passing. Good > > luck. > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _ Messenger10年嘉年华,礼品大奖等你拿! http://10.msn.com.cn
Re: [OMPI users] How to create multi-thread parallel program using thread-safe send and recv?
If you want all threads to communicate via MPI, and your initially launching multiple parents, I don't really see the advantage of using threads at all. Why not launch 12 MPI processes? On Tue, 2009-09-22 at 10:32 -0700, Eugene Loh wrote: > guosong wrote: > > Thanks for responding. I used a linux cluster. I think I would like > > to create a model that is multithreaded and each thread can make MPI > > calls. I attached test code as follow. It has two pthreads and there > > are MPI calls in both of those two threads. In the main function, > > there are also MPI calls. Should I use a full multithreading? > I guess so. It seems like the created threads are expected to make > independent/concurrent message-passing calls. Do read the link I > sent. You need to convert from MPI_Init to MPI_Init_thread(), asking > for a full-multithreaded model and checking that you got it. Also > note in main() that the MPI_Isend() calls should be matched with > MPI_Wait() or similar calls. I guess the parent thread will sit in > such calls while the child threads do their own message passing. Good > luck. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] MPI Parent-Child process query
HI, I am fairly new to MPI. I am just wondering if it is possible for a child process in MPI to communicate with a process that is not a parent? Assistance is much appreciated. Many thanks and best regards, Blesson.
Re: [OMPI users] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless
The following are the ifconfig for both the Mac and the Linux respectively: fuji:openmpi-1.3.3 pallabdatta$ ifconfig lo0: flags=8049mtu 16384 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 inet 127.0.0.1 netmask 0xff00 inet6 ::1 prefixlen 128 gif0: flags=8010 mtu 1280 stf0: flags=0<> mtu 1280 en0: flags=8863 mtu 1500 inet6 fe80::21f:5bff:fe3d:eaac%en0 prefixlen 64 scopeid 0x4 inet 10.11.14.203 netmask 0xf000 broadcast 10.11.15.255 ether 00:1f:5b:3d:ea:ac media: autoselect (100baseTX ) status: active supported media: autoselect 10baseT/UTP 10baseT/UTP 10baseT/UTP 10baseT/UTP 100baseTX 100baseTX 100baseTX 100baseTX 1000baseT 1000baseT 1000baseT en1: flags=8863 mtu 1500 ether 00:1f:5b:3d:ea:ad media: autoselect status: inactive supported media: autoselect 10baseT/UTP 10baseT/UTP 10baseT/UTP 10baseT/UTP 100baseTX 100baseTX 100baseTX 100baseTX 1000baseT 1000baseT 1000baseT fw0: flags=8863 mtu 4078 lladdr 00:22:41:ff:fe:ed:7d:a8 media: autoselect status: inactive supported media: autoselect LINUX: pallabdatta@apex-backpack:~/backpack/src$ ifconfig loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:116 errors:0 dropped:0 overruns:0 frame:0 TX packets:116 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:11788 (11.7 KB) TX bytes:11788 (11.7 KB) wlan0 Link encap:Ethernet HWaddr 00:21:79:c2:54:c7 inet addr:10.11.14.205 Bcast:10.11.14.255 Mask:255.255.240.0 inet6 addr: fe80::221:79ff:fec2:54c7/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:72531 errors:0 dropped:0 overruns:0 frame:0 TX packets:28894 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:5459312 (5.4 MB) TX bytes:7264193 (7.2 MB) wmaster0 Link encap:UNSPEC HWaddr 00-21-79-C2-54-C7-34-63-00-00-00-00-00-00-00-00 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) The mac is a Two 2.26GHz Quad-Core Intel Xeon Mac Pro and the Linux Box is Ubuntu Server Edition 9.04. The Mac has the ethernet interface to connect to the network and the linux box connects via a wireless adapter (IOGEAR). Please help me any way I can fix this issue. It really needs to work for our project. thanks in advance, regards, pallab > My other concern was the following but I am not sure it applies here. > If you have multiple interfaces on the node, and they are on the same > subnet, then you cannot actually select what IP address to go out of. > You can only select the IP address you want to connect to. In these > cases, I have seen a hang because we think we are selecting an IP > address to go out of, but it actually goes out the other one. > Perhaps you can send the User's list the output from "ifconfig" on each > of the machines which would show all the interfaces. You need to get the > right arguments for ifconfig depending on the OS you are running on. > > One thought is make sure the ethernet interface is marked down on both > boxes if that is possible. > > Pallab Datta wrote: >> Any suggestions on to how to debug this further..?? >> do you think I need to enable any other option besides heterogeneous at >> the configure proompt.? >> >> >>> The -enable-heterogeneous should do the trick. And to answer the >>> previous question, yes, put both of the interfaces in the include list. >>> >>> --mca btl_tcp_if_include en0,wlan0 >>> >>> If that does not work, then I may have one other thought why it might >>> not work although perhaps not a solution. >>> >>> Rolf >>> >>> Pallab Datta wrote: >>> Hi Rolf, Do i need to configure openmpi with some specific options apart from --enable-heterogeneous..? I am currently using ./configure --prefix=/usr/local/ --enable-heterogeneous --disable-static --enable-shared --enable-debug on both ends...is the above correct..?! Please let me know. thanks and regards, pallab > Hi: > I
Re: [OMPI users] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless
Is this a bug running open-mpi over heterogeneous environments (between a mac and linux) over wireless links. Please suggest what needs to be done or what I am missing.?! Any clues as to how to debug this will be of great help. thanks and regards, pallab > Hi Rolf, > > I ran the following: > > pallabdatta$ /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca > btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca > btl_tcp_if_include en0,wlan0 -np 2 -hetero -H localhost,10.11.14.205 > /tmp/hello > > [fuji.local:02267] mca: base: components_open: Looking for btl components > [fuji.local:02267] mca: base: components_open: opening btl components > [fuji.local:02267] mca: base: components_open: found loaded component self > [fuji.local:02267] mca: base: components_open: component self has no > register function > [fuji.local:02267] mca: base: components_open: component self open > function successful > [fuji.local:02267] mca: base: components_open: found loaded component sm > [fuji.local:02267] mca: base: components_open: component sm has no > register function > [fuji.local:02267] mca: base: components_open: component sm open function > successful > [fuji.local:02267] mca: base: components_open: found loaded component tcp > [fuji.local:02267] mca: base: components_open: component tcp has no > register function > [fuji.local:02267] mca: base: components_open: component tcp open function > successful > [fuji.local:02267] select: initializing btl component self > [fuji.local:02267] select: init of component self returned success > [fuji.local:02267] select: initializing btl component sm > [fuji.local:02267] select: init of component sm returned success > [fuji.local:02267] select: initializing btl component tcp > [fuji.local][[59424,1],0][btl_tcp_component.c:468:mca_btl_tcp_component_create_instances] > invalid interface "wlan0" > [fuji.local:02267] select: init of component tcp returned success > [apex-backpack:31956] mca: base: components_open: Looking for btl > components > [apex-backpack:31956] mca: base: components_open: opening btl components > [apex-backpack:31956] mca: base: components_open: found loaded component > self > [apex-backpack:31956] mca: base: components_open: component self has no > register function > [apex-backpack:31956] mca: base: components_open: component self open > function successful > [apex-backpack:31956] mca: base: components_open: found loaded component > sm > [apex-backpack:31956] mca: base: components_open: component sm has no > register function > [apex-backpack:31956] mca: base: components_open: component sm open > function successful > [apex-backpack:31956] mca: base: components_open: found loaded component > tcp > [apex-backpack:31956] mca: base: components_open: component tcp has no > register function > [apex-backpack:31956] mca: base: components_open: component tcp open > function successful > [apex-backpack:31956] select: initializing btl component self > [apex-backpack:31956] select: init of component self returned success > [apex-backpack:31956] select: initializing btl component sm > [apex-backpack:31956] select: init of component sm returned success > [apex-backpack:31956] select: initializing btl component tcp > [apex-backpack][[59424,1],1][btl_tcp_component.c:468:mca_btl_tcp_component_create_instances] > invalid interface "en0" > [apex-backpack:31956] select: init of component tcp returned success > Process 0 on fuji.local out of 2 > Process 1 on apex-backpack out of 2 > [apex-backpack:31956] btl: tcp: attempting to connect() to address > 10.11.14.203 on port 9360 > > > > It launches the processes on both ends and then it hangs at the send > receive part..!! > What is the other thing that you were mentioning which makes you think > that its not working?!? > Please suggest.. > --regards, pallab > > > >> The -enable-heterogeneous should do the trick. And to answer the >> previous question, yes, put both of the interfaces in the include list. >> >> --mca btl_tcp_if_include en0,wlan0 >> >> If that does not work, then I may have one other thought why it might >> not work although perhaps not a solution. >> >> Rolf >> >> Pallab Datta wrote: >>> Hi Rolf, >>> >>> Do i need to configure openmpi with some specific options apart from >>> --enable-heterogeneous..? >>> I am currently using >>> ./configure --prefix=/usr/local/ --enable-heterogeneous >>> --disable-static >>> --enable-shared --enable-debug >>> >>> on both ends...is the above correct..?! Please let me know. >>> thanks and regards, >>> pallab >>> >>> Hi: I assume if you wait several minutes than your program will actually time out, yes? I guess I have two suggestions. First, can you run a non-MPI job using the wireless? Something like hostname? Secondly, you may want to specify the specific interfaces you want it to use on the two machines. You can do that via the "--mca btl_tcp_if_include" run-time parameter. Just list the ones that you expect it to use.
Re: [OMPI users] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless
Hi Rolf, I ran the following: pallabdatta$ /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca btl_tcp_if_include en0,wlan0 -np 2 -hetero -H localhost,10.11.14.205 /tmp/hello [fuji.local:02267] mca: base: components_open: Looking for btl components [fuji.local:02267] mca: base: components_open: opening btl components [fuji.local:02267] mca: base: components_open: found loaded component self [fuji.local:02267] mca: base: components_open: component self has no register function [fuji.local:02267] mca: base: components_open: component self open function successful [fuji.local:02267] mca: base: components_open: found loaded component sm [fuji.local:02267] mca: base: components_open: component sm has no register function [fuji.local:02267] mca: base: components_open: component sm open function successful [fuji.local:02267] mca: base: components_open: found loaded component tcp [fuji.local:02267] mca: base: components_open: component tcp has no register function [fuji.local:02267] mca: base: components_open: component tcp open function successful [fuji.local:02267] select: initializing btl component self [fuji.local:02267] select: init of component self returned success [fuji.local:02267] select: initializing btl component sm [fuji.local:02267] select: init of component sm returned success [fuji.local:02267] select: initializing btl component tcp [fuji.local][[59424,1],0][btl_tcp_component.c:468:mca_btl_tcp_component_create_instances] invalid interface "wlan0" [fuji.local:02267] select: init of component tcp returned success [apex-backpack:31956] mca: base: components_open: Looking for btl components [apex-backpack:31956] mca: base: components_open: opening btl components [apex-backpack:31956] mca: base: components_open: found loaded component self [apex-backpack:31956] mca: base: components_open: component self has no register function [apex-backpack:31956] mca: base: components_open: component self open function successful [apex-backpack:31956] mca: base: components_open: found loaded component sm [apex-backpack:31956] mca: base: components_open: component sm has no register function [apex-backpack:31956] mca: base: components_open: component sm open function successful [apex-backpack:31956] mca: base: components_open: found loaded component tcp [apex-backpack:31956] mca: base: components_open: component tcp has no register function [apex-backpack:31956] mca: base: components_open: component tcp open function successful [apex-backpack:31956] select: initializing btl component self [apex-backpack:31956] select: init of component self returned success [apex-backpack:31956] select: initializing btl component sm [apex-backpack:31956] select: init of component sm returned success [apex-backpack:31956] select: initializing btl component tcp [apex-backpack][[59424,1],1][btl_tcp_component.c:468:mca_btl_tcp_component_create_instances] invalid interface "en0" [apex-backpack:31956] select: init of component tcp returned success Process 0 on fuji.local out of 2 Process 1 on apex-backpack out of 2 [apex-backpack:31956] btl: tcp: attempting to connect() to address 10.11.14.203 on port 9360 It launches the processes on both ends and then it hangs at the send receive part..!! What is the other thing that you were mentioning which makes you think that its not working?!? Please suggest.. --regards, pallab > The -enable-heterogeneous should do the trick. And to answer the > previous question, yes, put both of the interfaces in the include list. > > --mca btl_tcp_if_include en0,wlan0 > > If that does not work, then I may have one other thought why it might > not work although perhaps not a solution. > > Rolf > > Pallab Datta wrote: >> Hi Rolf, >> >> Do i need to configure openmpi with some specific options apart from >> --enable-heterogeneous..? >> I am currently using >> ./configure --prefix=/usr/local/ --enable-heterogeneous --disable-static >> --enable-shared --enable-debug >> >> on both ends...is the above correct..?! Please let me know. >> thanks and regards, >> pallab >> >> >>> Hi: >>> I assume if you wait several minutes than your program will actually >>> time out, yes? I guess I have two suggestions. First, can you run a >>> non-MPI job using the wireless? Something like hostname? Secondly, >>> you >>> may want to specify the specific interfaces you want it to use on the >>> two machines. You can do that via the "--mca btl_tcp_if_include" >>> run-time parameter. Just list the ones that you expect it to use. >>> >>> Also, this is not right - "--mca OMPI_mca_mpi_preconnect_all 1" It >>> should be --mca mpi_preconnect_mpi 1 if you want to do the connection >>> during MPI_Init. >>> >>> Rolf >>> >>> Pallab Datta wrote: >>> The following is the error dump fuji:src pallabdatta$ /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca btl tcp,self --mca
Re: [OMPI users] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless
Hi Rolf, Thanks for the suggestions. I will try it. I can run a non-mpi program over wireless. My mac's ethernet interface is en0, and my linux's wireless is wlan0..can I mention both in the --mca btl__tcp_if_include option?! thanks a lot in advance, regards, pallab > Hi: > I assume if you wait several minutes than your program will actually > time out, yes? I guess I have two suggestions. First, can you run a > non-MPI job using the wireless? Something like hostname? Secondly, you > may want to specify the specific interfaces you want it to use on the > two machines. You can do that via the "--mca btl_tcp_if_include" > run-time parameter. Just list the ones that you expect it to use. > > Also, this is not right - "--mca OMPI_mca_mpi_preconnect_all 1" It > should be --mca mpi_preconnect_mpi 1 if you want to do the connection > during MPI_Init. > > Rolf >> > > > Pallab Datta wrote: >> The following is the error dump >> >> fuji:src pallabdatta$ /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 >> 36900 -mca btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca btl >> tcp,self --mca OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H >> localhost,10.11.14.205 /tmp/hello >> [fuji.local:01316] mca: base: components_open: Looking for btl >> components >> [fuji.local:01316] mca: base: components_open: opening btl components >> [fuji.local:01316] mca: base: components_open: found loaded component >> self >> [fuji.local:01316] mca: base: components_open: component self has no >> register function >> [fuji.local:01316] mca: base: components_open: component self open >> function successful >> [fuji.local:01316] mca: base: components_open: found loaded component >> tcp >> [fuji.local:01316] mca: base: components_open: component tcp has no >> register function >> [fuji.local:01316] mca: base: components_open: component tcp open >> function >> successful >> [fuji.local:01316] select: initializing btl component self >> [fuji.local:01316] select: init of component self returned success >> [fuji.local:01316] select: initializing btl component tcp >> [fuji.local:01316] select: init of component tcp returned success >> [apex-backpack:04753] mca: base: components_open: Looking for btl >> components >> [apex-backpack:04753] mca: base: components_open: opening btl components >> [apex-backpack:04753] mca: base: components_open: found loaded component >> self >> [apex-backpack:04753] mca: base: components_open: component self has no >> register function >> [apex-backpack:04753] mca: base: components_open: component self open >> function successful >> [apex-backpack:04753] mca: base: components_open: found loaded component >> tcp >> [apex-backpack:04753] mca: base: components_open: component tcp has no >> register function >> [apex-backpack:04753] mca: base: components_open: component tcp open >> function successful >> [apex-backpack:04753] select: initializing btl component self >> [apex-backpack:04753] select: init of component self returned success >> [apex-backpack:04753] select: initializing btl component tcp >> [apex-backpack:04753] select: init of component tcp returned success >> Process 0 on fuji.local out of 2 >> Process 1 on apex-backpack out of 2 >> [apex-backpack:04753] btl: tcp: attempting to connect() to address >> 10.11.14.203 on port 9360 >> >> >> >> >> >>> Hi >>> >>> I am trying to run open-mpi 1.3.3. between a linux box running ubuntu >>> server v.9.04 and a Macintosh. I have configured openmpi with the >>> following options.: >>> ./configure --prefix=/usr/local/ --enable-heterogeneous >>> --disable-shared >>> --enable-static >>> >>> When both the machines are connected to the network via ethernet cables >>> openmpi works fine. >>> >>> But when I switch the linux box to a wireless adapter i can reach >>> (ping) >>> the macintosh >>> but openmpi hangs on a hello world program. >>> >>> I ran : >>> >>> /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca >>> btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca >>> OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H localhost,10.11.14.205 >>> /tmp/back >>> >>> it hangs on a send receive function between the two ends. All my >>> firewalls >>> are turned off at the macintosh end. PLEASE HELP ASAP> >>> regards, >>> pallab >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > -- > > = > rolf.vandeva...@sun.com > 781-442-3043 > = > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] How to create multi-thread parallel program using thread-safe send and recv?
Thanks for responding. I used a linux cluster. I think I would like to create a model that is multithreaded and each thread can make MPI calls. I attached test code as follow. It has two pthreads and there are MPI calls in both of those two threads. In the main function, there are also MPI calls. Should I use a full multithreading? Thanks again. #include #include #include #include #include #include "mpi.h" using namespace std; pthread_mutex_t _dealmutex; pthread_mutex_t _dealmutex1; pthread_mutex_t _dealmutex2; void* backID(void* arg) { int myid; pthread_mutex_init(&_dealmutex1, NULL); stringstream RANK; MPI_Comm_rank(MPI_COMM_WORLD, ); RANK << myid; cout << myid << " create background ID" << endl; int v; MPI_Status status; MPI_Request requ1, requ2; int m; int x, y; int count = 0; string filename("f_"); filename += RANK.str(); filename += "_backID.txt"; fstream fout(filename.c_str(), ios::out); if(!fout) { cout << "can not create the file " << filename << endl; fout.close(); exit(1); } while(true) { MPI_Irecv(, 1, MPI_INT, MPI_ANY_SOURCE, 222, MPI_COMM_WORLD, ); MPI_Wait(, ); //fout << myid << " recv from " << status.MPI_SOURCE << " m = " << m << " with tag 222" << endl; //pthread_mutex_lock(&_dealmutex1); //cout << "BACKID_REV:" << myid << " recv from " << status.MPI_SOURCE << " m = " << m << " with tag 222" << endl; fout << "BACKID_REV:" << myid << " recv from " << status.MPI_SOURCE << " m = " << m << " with tag 222" << endl; //fflush(stdout); fout.flush(); //pthread_mutex_unlock(&_dealmutex1); //m++; MPI_Send(, 1, MPI_INT, status.MPI_SOURCE, 333, MPI_COMM_WORLD); //MPI_Isend(, 1, MPI_INT, status.MPI_SOURCE, 333, MPI_COMM_WORLD, ); //pthread_mutex_lock(&_dealmutex1); //fout << myid << " replies " << status.MPI_SOURCE << " m = " << m << endl; //cout << "BACKID_SEND:" << myid << " replies " << status.MPI_SOURCE << " m = " << m << endl; fout << "BACKID_SEND:" << myid << " replies " << status.MPI_SOURCE << " m = " << m << endl; //fflush(stdout); fout.flush(); //pthread_mutex_unlock(&_dealmutex); count++; //pthread_mutex_unlock(&_dealmutex1); if(count == 50) { fout << "***backID FINISHED IN " << myid << "" << endl; fout.flush(); fout.close(); pthread_exit(NULL); return 0; } }; } void* backRecv(void* arg) { int myid; pthread_mutex_init(&_dealmutex2, NULL); stringstream RANK; MPI_Status status; MPI_Request requ2; MPI_Comm_rank(MPI_COMM_WORLD, ); RANK << myid; cout << myid << " create background message recv" << endl; int x, y; //char c; int m; int count = 0; string filename("f_"); filename += RANK.str(); filename += "_backRecv.txt"; fstream fout(filename.c_str(), ios::out); if(!fout) { cout << "can not create the file " << filename << endl; fout.close(); exit(1); } while(true) { MPI_Irecv(, 1, MPI_INT, MPI_ANY_SOURCE, 333, MPI_COMM_WORLD, ); MPI_Wait(, ); //pthread_mutex_lock(&_dealmutex2); fout << "BACKREV:" << myid << " recv from " << status.MPI_SOURCE << " m = " << m << " with tag 333" << endl; fout.flush(); //cout << "BACKREV:" << myid << " recv from " << status.MPI_SOURCE << " m = " << m << " with tag 333" << endl; //fflush(stdout); //pthread_mutex_unlock(&_dealmutex); //pthread_mutex_lock(&_dealmutex); count++; //pthread_mutex_unlock(&_dealmutex2); if(count == 50) { fout << "***backRecv FINISHED IN " << myid << "" << endl; fout.flush(); fout.close(); pthread_exit(NULL); return 0; } }; } int main(int argc, char **argv) { int myid = 0; int nprocs = 0; pthread_t pt1 = 0; pthread_t pt2 = 0;; int pret1 = 0; int pret2 = 0; int i = 0, j = 0, t = 0; //MPI_Status status; MPI_Request requ1; MPI_Init(,); MPI_Comm_size(MPI_COMM_WORLD,); MPI_Comm_rank(MPI_COMM_WORLD,); pthread_mutex_init(&_dealmutex, NULL); for(i=0; i<50; ++i) { t = (myid + 1) * i; MPI_Isend(, 1, MPI_INT, (myid+1)%nprocs, 222, MPI_COMM_WORLD, ); //MPI_Sendrecv(, 1, MPI_INT, (myid+1)%nprocs, 222, , 1, MPI_INT, (myid+1)%nprocs, 333, MPI_COMM_WORLD, ); cout << "MAIN:" << myid << " sends to "<< (myid+1)%nprocs << " " << myid << endl; fflush(stdout); } pret1 = pthread_create(, NULL, backRecv, NULL); if(pret1 != 0) { cout << myid << "backRecv Thread Create Failed." << endl; exit(1); } pret2 = pthread_create(, NULL, backID, NULL); if(pret2 != 0) { cout << myid << "backID Thread Create Failed." << endl; exit(1); } //for(i=0; i<10; ++i) //{ // c += i; // MPI_Send(, 1, MPI_CHAR, (myid+1)%nprocs, 111, MPI_COMM_WORLD); // cout << myid << " send " << (char)c << " to " << (myid+1)%nprocs << endl; //} pthread_join(pt2, NULL); cout << "***THREAD 2 SUCESS!" << endl; pthread_join(pt1, NULL); cout << "***THREAD 1 SUCESS!" << endl; MPI_Finalize(); cout << "***MAIN SUCESS!" <<
Re: [OMPI users] How to create multi-thread parallel program using thread-safe send and recv?
guosong wrote: Hi all, I would like to write a multi-thread parallel program. I used pthread. Basicly, I want to create two background threads besides the main thread(process). For example, if I use "-np 4", the program should have 4 main processes on four processors and two background threads for each main process. So there should be 8 threads totally. Wouldn't there be 4 main threads and 8 "slave" threads for a total of 12 threads? Anyhow, doesn't matter. I'm not sure where you're starting, but you should at least have a basic understanding of the different sorts of multithreaded programming models in MPI. One is that each process is single threaded. Another is the processes are multithreaded, but only the main thread makes MPI calls. Another is multithreaded, but only one MPI call at a time. Finally, there can be full multithreading. You have to decide which of these programming models you want and which is supported by your MPI (or, if OMPI, how OMPI was built). For more information, try the MPI_Init_thread() man page or http://www.mpi-forum.org./docs/mpi21-report.pdf ... see Section 12.4 on "MPI and Threads". I wrote a test program and it worked unpredictable. Sometimes I got the result I want, but sometimes the program got segmentation fault. I used MPI_Isend and MPI_Irecv for sending and recving. I do not know why? I attached the error message as follow: [cheetah:29780] *** Process received signal *** [cheetah:29780] Signal: Segmentation fault (11) [cheetah:29780] Signal code: Address not mapped (1) [cheetah:29780] Failing at address: 0x10 [cheetah:29779] *** Process received signal *** [cheetah:29779] Signal: Segmentation fault (11) [cheetah:29779] Signal code: Address not mapped (1) [cheetah:29779] Failing at address: 0x10 [cheetah:29780] [ 0] /lib64/libpthread.so.0 [0x334b00de70] [cheetah:29780] [ 1] /act/openmpi/gnu/lib/openmpi/mca_btl_sm.so [0x2b90e1227940] [cheetah:29780] [ 2] /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so [0x2b90e05d61ca] [cheetah:29780] [ 3] /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so [0x2b90e05cac86] [cheetah:29780] [ 4] /act/openmpi/gnu/lib/libmpi.so.0(PMPI_Send+0x13d) [0x2b90dde7271d] [cheetah:29780] [ 5] pt_muti(_Z6backIDPv+0x29b) [0x409929] [cheetah:29780] [ 6] /lib64/libpthread.so.0 [0x334b0062f7] [cheetah:29780] [ 7] /lib64/libc.so.6(clone+0x6d) [0x334! a4d1e3d] [cheetah:29780] *** End of error message *** [cheetah:29779] [ 0] /lib64/libpthread.so.0 [0x334b00de70] [cheetah:29779] [ 1] /act/openmpi/gnu/lib/openmpi/mca_btl_sm.so [0x2b39785c0940] [cheetah:29779] [ 2] /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so [0x2b397796f1ca] [cheetah:29779] [ 3] /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so [0x2b3977963c86] [cheetah:29779] [ 4] /act/openmpi/gnu/lib/libmpi.so.0(PMPI_Send+0x13d) [0x2b397520b71d] [cheetah:29779] [ 5] pt_muti(_Z6backIDPv+0x29b) [0x409929] [cheetah:29779] [ 6] /lib64/libpthread.so.0 [0x334b0062f7] [cheetah:29779] [ 7] /lib64/libc.so.6(clone+0x6d) [0x334a4d1e3d] [cheetah:29779] *** End of error message *** I used gdb to "bt" the error and I got : Program terminated with signal 11, Segmentation fault. #0 0x2b90e1227940 in mca_btl_sm_alloc () from /act/openmpi/gnu/lib/openmpi/mca_btl_sm.so (gdb) bt #0 0x2b90e1227940 in mca_btl_sm_alloc () from /act/openmpi/gnu/lib/openmpi/mca_btl_sm.so #1 0x2b90e05d61ca in mca_pml_ob1_send_request_start_copy () from /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so #2 0x2b90e05cac86 in mca_pml_ob1_send () from /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so #3 0x2b90dde7271d in PMPI_Send () from /act/openmpi/gnu/lib/libmpi.so.0 #4 0x00409929 in backID (arg=0x0) at pt_muti.cpp:50 #5 0x00334b0062f7 in start_thread () from /lib64/libpthread.so.0 #6 0x00334a4d1e3d in clone () from /lib64/libc.so.6 So can anyone give me some suggestions or advice. Thanks very much.
Re: [OMPI users] MPI_Irecv segmentation fault
Did you also change the "" to buffer in your MPI_Send call? Jody On Tue, Sep 22, 2009 at 1:38 PM, Everette Clemmerwrote: > Hmm, tried changing MPI_Irecv( ) to MPI_Irecv( buffer...) > and still no luck. Stack trace follows if that's helpful: > > prompt$ mpirun -np 2 ./display_test_debug > Sending 'q' from node 0 to node 1 > [COMPUTER:50898] *** Process received signal *** > [COMPUTER:50898] Signal: Segmentation fault (11) > [COMPUTER:50898] Signal code: (0) > [COMPUTER:50898] Failing at address: 0x0 > [COMPUTER:50898] [ 0] 2 libSystem.B.dylib > 0x7fff87e280aa _sigtramp + 26 > [COMPUTER:50898] [ 1] 3 ??? > 0x 0x0 + 0 > [COMPUTER:50898] [ 2] 4 GLUT > 0x000100024a21 glutMainLoop + 261 > [COMPUTER:50898] [ 3] 5 display_test_debug > 0x00011444 xsMainLoop + 67 > [COMPUTER:50898] [ 4] 6 display_test_debug > 0x00011335 main + 59 > [COMPUTER:50898] [ 5] 7 display_test_debug > 0x00010d9c start + 52 > [COMPUTER:50898] [ 6] 8 ??? > 0x0001 0x0 + 1 > [COMPUTER:50898] *** End of error message *** > mpirun noticed that job rank 0 with PID 50897 on node COMPUTER.local > exited on signal 15 (Terminated). > 1 additional process aborted (not shown) > > Thanks, > Everette > > > On Tue, Sep 22, 2009 at 2:28 AM, Ake Sandgren > wrote: >> On Mon, 2009-09-21 at 19:26 -0400, Everette Clemmer wrote: >>> Hey all, >>> >>> I'm getting a segmentation fault when I attempt to receive a single >>> character via MPI_Irecv. Code follows: >>> >>> void recv_func() { >>> if( !MASTER ) { >>> char buffer[ 1 ]; >>> int flag; >>> MPI_Request request; >>> MPI_Status status; >>> >>> MPI_Irecv( , 1, MPI_CHAR, 0, MPI_ANY_TAG, >>> MPI_COMM_WORLD, ); >> >> It should be MPI_Irecv(buffer, 1, ...) >> >>> The segfault disappears if I comment out the MPI_Irecv call in >>> recv_func so I'm assuming that there's something wrong with the >>> parameters that I'm passing to it. Thoughts? >> >> -- >> Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden >> Internet: a...@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90 7866126 >> Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > > -- > - Everette > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] MPI_Comm_spawn query
Hi, I am fairly new to MPI.I have a few queries regarding spawning processes that I am listing below: a. How can processes send data to a spawned process? b. Can any process (that is not a parent process) send data to a spawned process? c. Can MPI_Send or MPI_Recv be used to communicate with a spawned process? d. Would it be possible in MPI to tell which processor of a cluster a process should be spawned? Looking forward to your reply. Would much appreciate if you could please include code snippets for the same. Many thanks and best regards, Blesson.
Re: [OMPI users] MPI_Irecv segmentation fault
Hmm, tried changing MPI_Irecv( ) to MPI_Irecv( buffer...) and still no luck. Stack trace follows if that's helpful: prompt$ mpirun -np 2 ./display_test_debug Sending 'q' from node 0 to node 1 [COMPUTER:50898] *** Process received signal *** [COMPUTER:50898] Signal: Segmentation fault (11) [COMPUTER:50898] Signal code: (0) [COMPUTER:50898] Failing at address: 0x0 [COMPUTER:50898] [ 0] 2 libSystem.B.dylib 0x7fff87e280aa _sigtramp + 26 [COMPUTER:50898] [ 1] 3 ??? 0x 0x0 + 0 [COMPUTER:50898] [ 2] 4 GLUT 0x000100024a21 glutMainLoop + 261 [COMPUTER:50898] [ 3] 5 display_test_debug 0x00011444 xsMainLoop + 67 [COMPUTER:50898] [ 4] 6 display_test_debug 0x00011335 main + 59 [COMPUTER:50898] [ 5] 7 display_test_debug 0x00010d9c start + 52 [COMPUTER:50898] [ 6] 8 ??? 0x0001 0x0 + 1 [COMPUTER:50898] *** End of error message *** mpirun noticed that job rank 0 with PID 50897 on node COMPUTER.local exited on signal 15 (Terminated). 1 additional process aborted (not shown) Thanks, Everette On Tue, Sep 22, 2009 at 2:28 AM, Ake Sandgrenwrote: > On Mon, 2009-09-21 at 19:26 -0400, Everette Clemmer wrote: >> Hey all, >> >> I'm getting a segmentation fault when I attempt to receive a single >> character via MPI_Irecv. Code follows: >> >> void recv_func() { >> if( !MASTER ) { >> char buffer[ 1 ]; >> int flag; >> MPI_Request request; >> MPI_Status status; >> >> MPI_Irecv( , 1, MPI_CHAR, 0, MPI_ANY_TAG, >> MPI_COMM_WORLD, ); > > It should be MPI_Irecv(buffer, 1, ...) > >> The segfault disappears if I comment out the MPI_Irecv call in >> recv_func so I'm assuming that there's something wrong with the >> parameters that I'm passing to it. Thoughts? > > -- > Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden > Internet: a...@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90 7866126 > Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- - Everette
Re: [OMPI users] MPI_Irecv segmentation fault
On Mon, 2009-09-21 at 19:26 -0400, Everette Clemmer wrote: > Hey all, > > I'm getting a segmentation fault when I attempt to receive a single > character via MPI_Irecv. Code follows: > > void recv_func() { > if( !MASTER ) { > charbuffer[ 1 ]; > int flag; > MPI_Request request; > MPI_Status status; > > MPI_Irecv( , 1, MPI_CHAR, 0, MPI_ANY_TAG, > MPI_COMM_WORLD, ); It should be MPI_Irecv(buffer, 1, ...) > The segfault disappears if I comment out the MPI_Irecv call in > recv_func so I'm assuming that there's something wrong with the > parameters that I'm passing to it. Thoughts? -- Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden Internet: a...@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90 7866126 Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
[OMPI users] error in ompi-checkpoint
[root@localhost examples]# mpirun -np 4 -am ft-enable-cr ./res 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 -- Error: The process with PID 19735 is not checkpointable. This could be due to one of the following: - An application with this PID doesn't currently exist - The application with this PID isn't checkpointable - The application with this PID isn't an OPAL application. We were looking for the named files: /tmp/opal_cr_prog_write.19735 /tmp/opal_cr_prog_read.19735 -- [localhost.localdomain:19733] local) Error: Unable to initiate the handshake with peer [[17893,1],1]. -1 [localhost.localdomain:19733] [[17893,0],0] ORTE_ERROR_LOG: Error in file snapc_full_global.c at line 567 [localhost.localdomain:19733] [[17893,0],0] ORTE_ERROR_LOG: Error in file snapc_full_global.c at line 1054 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Note: pid of mpirun is 19733
[OMPI users] How to create multi-thread parallel program using thread-safe send and recv?
Hi all, I would like to write a multi-thread parallel program. I used pthread. Basicly, I want to create two background threads besides the main thread(process). For example, if I use "-np 4", the program should have 4 main processes on four processors and two background threads for each main process. So there should be 8 threads totally. I wrote a test program and it worked unpredictable. Sometimes I got the result I want, but sometimes the program got segmentation fault. I used MPI_Isend and MPI_Irecv for sending and recving. I do not know why? I attached the error message as follow: [cheetah:29780] *** Process received signal *** [cheetah:29780] Signal: Segmentation fault (11) [cheetah:29780] Signal code: Address not mapped (1) [cheetah:29780] Failing at address: 0x10 [cheetah:29779] *** Process received signal *** [cheetah:29779] Signal: Segmentation fault (11) [cheetah:29779] Signal code: Address not mapped (1) [cheetah:29779] Failing at address: 0x10 [cheetah:29780] [ 0] /lib64/libpthread.so.0 [0x334b00de70] [cheetah:29780] [ 1] /act/openmpi/gnu/lib/openmpi/mca_btl_sm.so [0x2b90e1227940] [cheetah:29780] [ 2] /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so [0x2b90e05d61ca] [cheetah:29780] [ 3] /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so [0x2b90e05cac86] [cheetah:29780] [ 4] /act/openmpi/gnu/lib/libmpi.so.0(PMPI_Send+0x13d) [0x2b90dde7271d] [cheetah:29780] [ 5] pt_muti(_Z6backIDPv+0x29b) [0x409929] [cheetah:29780] [ 6] /lib64/libpthread.so.0 [0x334b0062f7] [cheetah:29780] [ 7] /lib64/libc.so.6(clone+0x6d) [0x334a4d1e3d] [cheetah:29780] *** End of error message *** [cheetah:29779] [ 0] /lib64/libpthread.so.0 [0x334b00de70] [cheetah:29779] [ 1] /act/openmpi/gnu/lib/openmpi/mca_btl_sm.so [0x2b39785c0940] [cheetah:29779] [ 2] /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so [0x2b397796f1ca] [cheetah:29779] [ 3] /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so [0x2b3977963c86] [cheetah:29779] [ 4] /act/openmpi/gnu/lib/libmpi.so.0(PMPI_Send+0x13d) [0x2b397520b71d] [cheetah:29779] [ 5] pt_muti(_Z6backIDPv+0x29b) [0x409929] [cheetah:29779] [ 6] /lib64/libpthread.so.0 [0x334b0062f7] [cheetah:29779] [ 7] /lib64/libc.so.6(clone+0x6d) [0x334a4d1e3d] [cheetah:29779] *** End of error message *** I used gdb to "bt" the error and I got : Program terminated with signal 11, Segmentation fault. #0 0x2b90e1227940 in mca_btl_sm_alloc () from /act/openmpi/gnu/lib/openmpi/mca_btl_sm.so (gdb) bt #0 0x2b90e1227940 in mca_btl_sm_alloc () from /act/openmpi/gnu/lib/openmpi/mca_btl_sm.so #1 0x2b90e05d61ca in mca_pml_ob1_send_request_start_copy () from /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so #2 0x2b90e05cac86 in mca_pml_ob1_send () from /act/openmpi/gnu/lib/openmpi/mca_pml_ob1.so #3 0x2b90dde7271d in PMPI_Send () from /act/openmpi/gnu/lib/libmpi.so.0 #4 0x00409929 in backID (arg=0x0) at pt_muti.cpp:50 #5 0x00334b0062f7 in start_thread () from /lib64/libpthread.so.0 #6 0x00334a4d1e3d in clone () from /lib64/libc.so.6 So can anyone give me some suggestions or advice. Thanks very much. _ 上Windows Live 中国首页,下载最新版Messenger! http://www.windowslive.cn