Re: [OMPI users] [OMP users]: OpenMP1.4 tuning for sending large messages
Hello, Are you using heterogeneous environment? There was a similar issue recently with segfault on mixed x86 and x86_64 environment. Here is corresponding thread in ompi-devel: http://www.open-mpi.org/community/lists/devel/2010/04/7787.php This was fixed in trunk and will likely be fixed in next 1.4 release. You can download last trunk snapshot from here http://www.open-mpi.org/nightly/trunk/ and test it. В Пнд, 26/04/2010 в 15:28 -0400, Pooja Varshneya пишет: > Hi All, > > I am using OpenMPI 1.4 on a cluster of Intel quad-core processors > running Linux and connected by ethernet. > > In an application, i m trying to send and receive large messages of > sizes ranging from 1 KB upto 500 MB. > The application works fine if the messages sizes are within 1 MB > range. When i try to send larger size messages, application crashes > with segmentation fault. I have tried to increase the size of btl_tcp > send and receive buffer, but it does not seem to be working. > > Are there any other settings i need to change to enable large messages > to be sent ? > I am using boost serialization and boost mpi libraries to simplify > message packing and unpacking. > > mpirun -np 3 --mca btl_tcp_eager_limit 536870912 --mca > btl_tcp_max_send_size 536870912 --mca > btl_tcp_rdma_pipeline_send_length 524288 --mca btl_tcp_sndbuf > 536870912 --mca btl_tcp_rcvbuf > 536870912 --hostfile hostfile2 --rankfile rankfile2 ./ > boost_binomial_no_LB > > > [rh5x64-u16:25446] *** Process received signal *** > [rh5x64-u16:25446] Signal: Segmentation fault (11) > [rh5x64-u16:25446] Signal code: Address not mapped (1) > [rh5x64-u16:25446] Failing at address: 0x2b12d14aafdc > [rh5x64-u16:25446] [ 0] /lib64/libpthread.so.0 [0x3ba680e7c0] > [rh5x64-u16:25446] [ 1] /lib64/libc.so.6(memcpy+0xa0) [0x3ba5c7be50] > [rh5x64-u16:25446] [ 2] /usr/local/lib/libmpi.so.0 [0x2b11ccbe0c02] > [rh5x64-u16:25446] [ 3] /usr/local/lib/libmpi.so.0(ompi_convertor_pack > +0x160) [0x2b11ccbe4930] > [rh5x64-u16:25446] [ 4] /usr/local/lib/openmpi/mca_btl_tcp.so > [0x2b11cffcaf67] > [rh5x64-u16:25446] [ 5] /usr/local/lib/openmpi/mca_pml_ob1.so > [0x2b11cf5af97a] > [rh5x64-u16:25446] [ 6] /usr/local/lib/openmpi/mca_pml_ob1.so > [0x2b11cf5a9b0d] > [rh5x64-u16:25446] [ 7] /usr/local/lib/openmpi/mca_btl_tcp.so > [0x2b11cffcd693] > [rh5x64-u16:25446] [ 8] /usr/local/lib/libopen-pal.so.0 [0x2b11cd0ab95b] > [rh5x64-u16:25446] [ 9] /usr/local/lib/libopen-pal.so.0(opal_progress > +0x9e) [0x2b11cd0a0b3e] > [rh5x64-u16:25446] [10] /usr/local/lib/libmpi.so.0 [0x2b11ccbd62c9] > [rh5x64-u16:25446] [11] /usr/local/lib/libmpi.so.0(PMPI_Test+0x73) > [0x2b11ccbfc863] > [rh5x64-u16:25446] [12] /usr/local/lib/libboost_mpi.so. > 1.42.0(_ZN5boost3mpi7request4testEv+0x13d) [0x2b11cc50451d] > [rh5x64-u16:25446] [13] ./ > boost_binomial_no_LB(_ZN5boost3mpi8wait_allIPNS0_7requestEEEvT_S4_ > +0x19d) [0x42206d] > [rh5x64-u16:25446] [14] ./boost_binomial_no_LB [0x41c82a] > [rh5x64-u16:25446] [15] ./boost_binomial_no_LB(main+0x169) [0x41d4a9] > [rh5x64-u16:25446] [16] /lib64/libc.so.6(__libc_start_main+0xf4) > [0x3ba5c1d994] > [rh5x64-u16:25446] [17] ./ > boost_binomial_no_LB(__gxx_personality_v0+0x371) [0x41a799] > [rh5x64-u16:25446] *** End of error message *** > -- > mpirun noticed that process rank 0 with PID 25446 on node 172.10.0.116 > exited on signal 11 (Segmentation fault). > ------ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Kind regards, Timur Magomedov Senior C++ Developer DevelopOnBox LLC / Zodiac Interactive http://www.zodiac.tv/
Re: [OMPI users] Segmentation fault when Send/Recv on heterogeneouscluster (32/64 bit machines)
Hello, You can get nightly trunk snapshot from here http://www.open-mpi.org/nightly/trunk/ You can grab openmpi-1.7a1r23032 and test it. This will be great. В Пнд, 26/04/2010 в 10:26 +0200, TRINH Minh Hieu пишет: > > Hello, > > I can help to test the patch if you need to. But I don't know much how > to you svn to get the latest source to test. > Regards > >TMHieu > > > Message: 1 > Date: Fri, 23 Apr 2010 20:15:58 +0400 > From: Timur Magomedov <timur.magome...@developonbox.ru> > Subject: Re: [OMPI users] Segmentation fault when Send/Recv on >heterogeneouscluster (32/64 bit machines) > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <1272039358.4818.137.camel@magomedov-desktop> > Content-Type: text/plain; charset="UTF-8" > > Hello, > It seems that this was really a bug. It was recently fixed in > repository > https://svn.open-mpi.org/trac/ompi/changeset/23030 > and will likely be fixed in next 1.4 release. > > Here is corresponding thread in ompi-devel: > http://www.open-mpi.org/community/lists/devel/2010/04/7787.php > > > -- > > M. TRINH Minh Hieu > CEA, IBEB, SBTN/LIRM, > F-30207 Bagnols-sur-Cèze, FRANCE > > _______ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Kind regards, Timur Magomedov Senior C++ Developer DevelopOnBox LLC / Zodiac Interactive http://www.zodiac.tv/
Re: [OMPI users] Segmentation fault when Send/Recv on heterogeneouscluster (32/64 bit machines)
LD); > >>>>> } > >>>>> code > >>>>> > >>>>> I got segmentation fault with n=1 but no error with n=1000 > >>>>> I have 2 machines : > >>>>> sbtn155 : Intel Xeon, x86_64 > >>>>> sbtn211 : Intel Pentium 4, i686 > >>>>> > >>>>> The code is compiled in x86_64 and i686 machine, using OpenMPI > 1.4.1, > >>>>> installed in /tmp/openmpi : > >>>>> [mhtrinh@sbtn211 heterogenous]$ make hetero > >>>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o > hetero.i686.o > >>>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 > -I/tmp/openmpi/include > >>>>> hetero.i686.o -o hetero.i686 -lm > >>>>> > >>>>> [mhtrinh@sbtn155 heterogenous]$ make hetero > >>>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o > hetero.x86_64.o > >>>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 > -I/tmp/openmpi/include > >>>>> hetero.x86_64.o -o hetero.x86_64 -lm > >>>>> > >>>>> I run with the code using appfile and got thoses error : > >>>>> $ cat appfile > >>>>> --host sbtn155 -np 1 hetero.x86_64 > >>>>> --host sbtn155 -np 1 hetero.x86_64 > >>>>> --host sbtn211 -np 1 hetero.i686 > >>>>> > >>>>> $ mpirun -hetero --app appfile > >>>>> Input array length : > >>>>> 1 > >>>>> Receiving from proc 1 : OK > >>>>> Receiving from proc 2 : [sbtn155:26386] *** Process received > signal *** > >>>>> [sbtn155:26386] Signal: Segmentation fault (11) > >>>>> [sbtn155:26386] Signal code: Address not mapped (1) > >>>>> [sbtn155:26386] Failing at address: 0x200627bd8 > >>>>> [sbtn155:26386] [ 0] /lib64/libpthread.so.0 [0x3fa4e0e540] > >>>>> [sbtn155:26386] [ 1] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so > [0x2d8d7908] > >>>>> [sbtn155:26386] [ 2] /tmp/openmpi/lib/openmpi/mca_btl_tcp.so > [0x2e2fc6e3] > >>>>> [sbtn155:26386] [ 3] /tmp/openmpi/lib/libopen-pal.so.0 > [0x2afe39db] > >>>>> [sbtn155:26386] [ 4] > >>>>> /tmp/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) > [0x2afd8b9e] > >>>>> [sbtn155:26386] [ 5] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so > [0x2d8d4b25] > >>>>> [sbtn155:26386] [ 6] /tmp/openmpi/lib/libmpi.so.0(MPI_Recv > +0x13b) > >>>>> [0x2ab30f9b] > >>>>> [sbtn155:26386] [ 7] hetero.x86_64(main+0xde) [0x400cbe] > >>>>> [sbtn155:26386] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) > [0x3fa421e074] > >>>>> [sbtn155:26386] [ 9] hetero.x86_64 [0x400b29] > >>>>> [sbtn155:26386] *** End of error message *** > >>>>> > -- > >>>>> mpirun noticed that process rank 0 with PID 26386 on node > sbtn155 > >>>>> exited on signal 11 (Segmentation fault). > >>>>> > -- > >>>>> > >>>>> Am I missing an option in order to run in heterogenous cluster ? > >>>>> MPI_Send/Recv have limit array size when using heterogeneous > cluster ? > >>>>> Thanks for your help. Regards > >>>>> > >>>>> -- > >>>>> > >>>>>M. TRINH Minh Hieu > >>>>>CEA, IBEB, SBTN/LIRM, > >>>>>F-30207 Bagnols-sur-Cèze, FRANCE > >>>>> > >>>>> > >>>>> ___ > >>>>> users mailing list > >>>>> us...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> > >>>> -- > >>>> Jeff Squyres > >>>> jsquy...@cisco.com > >>>> For corporate legal information go to: > >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ > >>>> > >>>> > >>>> ___ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >> > >> ___ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > -- > > M. TRINH Minh Hieu > CEA, IBEB, SBTN/LIRM, > F-30207 Bagnols-sur-Cèze, FRANCE > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Kind regards, Timur Magomedov Senior C++ Developer DevelopOnBox LLC / Zodiac Interactive http://www.zodiac.tv/
Re: [OMPI users] Fwd: Open MPI v1.4 cant find default hostfile
Hello. It looks that you hostfile path should be /usr/local/etc/openmpi-default-hostfile not usr/local/etc/openmpi-default-hostfile but somehow Open MPI gets the second path. В Птн, 16/04/2010 в 19:10 +0200, Mario Ogrizek пишет: > Well, im not sure why should i name it /openmpi-default-hostfile > Especially, because mpirun v1.2 executes without any errors. > But, i made a copy named /openmpi-default-hostfile, and still, the > same result. > > This is the whole error message for a simple hello world program: > > > Open RTE was unable to open the hostfile: > usr/local/etc/openmpi-default-hostfile > Check to make sure the path and filename are correct. > -- > [Mario.local:04300] [[114,0],0] ORTE_ERROR_LOG: Not found in file > base/ras_base_allocate.c at line 186 > [Mario.local:04300] [[114,0],0] ORTE_ERROR_LOG: Not found in file > base/plm_base_launch_support.c at line 72 > [Mario.local:04300] [[114,0],0] ORTE_ERROR_LOG: Not found in file > plm_rsh_module.c at line 990 > -- > A daemon (pid unknown) died unexpectedly on signal 1 while attempting > to > launch so we are aborting. > > > There may be more information reported by the environment (see above). > > > This may be because the daemon was unable to find all the needed > shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have > the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -- > -- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -- > mpirun: clean termination accomplished > > > > > ps. PTP is a parallel tools platform plugin for eclipse > > > Regards, > > > Mario > > _______ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Kind regards, Timur Magomedov Senior C++ Developer DevelopOnBox LLC / Zodiac Interactive http://www.zodiac.tv/
Re: [OMPI users] Segmentation fault when Send/Recv on heterogeneous cluster (32/64 bit machines)
Hello. It looks like you allocate memory in every loop iteration on process #0 and doesn't free it so malloc fails on some iteration. В Вск, 28/02/2010 в 19:22 +0100, TRINH Minh Hieu пишет: > Hello, > > I have some problems running MPI on my heterogeneous cluster. More > precisley i got segmentation fault when sending a large array (about > 1) of double from a i686 machine to a x86_64 machine. It does not > happen with small array. Here is the send/recv code source (complete > source is in attached file) : > code > if (me == 0 ) { > for (int pe=1; pe<nprocs; pe++) > { > printf("Receiving from proc %d : ",pe); fflush(stdout); > d=(double *)malloc(sizeof(double)*n); > MPI_Recv(d,n,MPI_DOUBLE,pe,999,MPI_COMM_WORLD,); > printf("OK\n"); fflush(stdout); > } > printf("All done.\n"); > } > else { > d=(double *)malloc(sizeof(double)*n); > MPI_Send(d,n,MPI_DOUBLE,0,999,MPI_COMM_WORLD); > } > code > > I got segmentation fault with n=1 but no error with n=1000 > I have 2 machines : > sbtn155 : Intel Xeon, x86_64 > sbtn211 : Intel Pentium 4, i686 > > The code is compiled in x86_64 and i686 machine, using OpenMPI 1.4.1, > installed in /tmp/openmpi : > [mhtrinh@sbtn211 heterogenous]$ make hetero > gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o hetero.i686.o > /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include > hetero.i686.o -o hetero.i686 -lm > > [mhtrinh@sbtn155 heterogenous]$ make hetero > gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o > hetero.x86_64.o > /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include > hetero.x86_64.o -o hetero.x86_64 -lm > > I run with the code using appfile and got thoses error : > $ cat appfile > --host sbtn155 -np 1 hetero.x86_64 > --host sbtn155 -np 1 hetero.x86_64 > --host sbtn211 -np 1 hetero.i686 > > $ mpirun -hetero --app appfile > Input array length : > 1 > Receiving from proc 1 : OK > Receiving from proc 2 : [sbtn155:26386] *** Process received signal *** > [sbtn155:26386] Signal: Segmentation fault (11) > [sbtn155:26386] Signal code: Address not mapped (1) > [sbtn155:26386] Failing at address: 0x200627bd8 > [sbtn155:26386] [ 0] /lib64/libpthread.so.0 [0x3fa4e0e540] > [sbtn155:26386] [ 1] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so [0x2d8d7908] > [sbtn155:26386] [ 2] /tmp/openmpi/lib/openmpi/mca_btl_tcp.so [0x2e2fc6e3] > [sbtn155:26386] [ 3] /tmp/openmpi/lib/libopen-pal.so.0 [0x2afe39db] > [sbtn155:26386] [ 4] > /tmp/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2afd8b9e] > [sbtn155:26386] [ 5] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so [0x2d8d4b25] > [sbtn155:26386] [ 6] /tmp/openmpi/lib/libmpi.so.0(MPI_Recv+0x13b) > [0x2ab30f9b] > [sbtn155:26386] [ 7] hetero.x86_64(main+0xde) [0x400cbe] > [sbtn155:26386] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3fa421e074] > [sbtn155:26386] [ 9] hetero.x86_64 [0x400b29] > [sbtn155:26386] *** End of error message *** > -- > mpirun noticed that process rank 0 with PID 26386 on node sbtn155 > exited on signal 11 (Segmentation fault). > -- > > Am I missing an option in order to run in heterogenous cluster ? > MPI_Send/Recv have limit array size when using heterogeneous cluster ? > Thanks for your help. Regards > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Kind regards, Timur Magomedov Senior C++ Developer DevelopOnBox LLC / Zodiac Interactive http://www.zodiac.tv/