Hi, I have some new discovery about this problem :
It seems that the array size sendable from a 32bit to 64bit machines is proportional to the parameter "btl_tcp_eager_limit" When I set it to 200 000 000 (2e08 bytes, about 190MB), I can send an array up to 2e07 double (152MB). I didn't found much informations about btl_tcp_eager_limit other than in the "ompi_info --all" command. If I let it at 2e08, will it impacts the performance of OpenMPI ? It may be noteworth also that if the master (rank 0) is a 32bit machines, I don't have segfault. I can send big array with small "btl_tcp_eager_limit" from a 64bit machine to a 32bit one. Do I have to move this thread to devel mailing list ? Regards, TMHieu On Tue, Mar 2, 2010 at 2:54 PM, TRINH Minh Hieu <mhtr...@gmail.com> wrote: > Hello, > > Yes, I compiled OpenMPI with --enable-heterogeneous. More precisely I > compiled with : > $ ./configure --prefix=/tmp/openmpi --enable-heterogeneous > --enable-cxx-exceptions --enable-shared > --enable-orterun-prefix-by-default > $ make all install > > I attach the output of ompi_info of my 2 machines. > > TMHieu > > On Tue, Mar 2, 2010 at 1:57 PM, Jeff Squyres <jsquy...@cisco.com> wrote: >> Did you configure Open MPI with --enable-heterogeneous? >> >> On Feb 28, 2010, at 1:22 PM, TRINH Minh Hieu wrote: >> >>> Hello, >>> >>> I have some problems running MPI on my heterogeneous cluster. More >>> precisley i got segmentation fault when sending a large array (about >>> 10000) of double from a i686 machine to a x86_64 machine. It does not >>> happen with small array. Here is the send/recv code source (complete >>> source is in attached file) : >>> ========code ================ >>> if (me == 0 ) { >>> for (int pe=1; pe<nprocs; pe++) >>> { >>> printf("Receiving from proc %d : ",pe); fflush(stdout); >>> d=(double *)malloc(sizeof(double)*n); >>> MPI_Recv(d,n,MPI_DOUBLE,pe,999,MPI_COMM_WORLD,&status); >>> printf("OK\n"); fflush(stdout); >>> } >>> printf("All done.\n"); >>> } >>> else { >>> d=(double *)malloc(sizeof(double)*n); >>> MPI_Send(d,n,MPI_DOUBLE,0,999,MPI_COMM_WORLD); >>> } >>> ======== code ================ >>> >>> I got segmentation fault with n=10000 but no error with n=1000 >>> I have 2 machines : >>> sbtn155 : Intel Xeon, x86_64 >>> sbtn211 : Intel Pentium 4, i686 >>> >>> The code is compiled in x86_64 and i686 machine, using OpenMPI 1.4.1, >>> installed in /tmp/openmpi : >>> [mhtrinh@sbtn211 heterogenous]$ make hetero >>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o >>> hetero.i686.o >>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include >>> hetero.i686.o -o hetero.i686 -lm >>> >>> [mhtrinh@sbtn155 heterogenous]$ make hetero >>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o >>> hetero.x86_64.o >>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include >>> hetero.x86_64.o -o hetero.x86_64 -lm >>> >>> I run with the code using appfile and got thoses error : >>> $ cat appfile >>> --host sbtn155 -np 1 hetero.x86_64 >>> --host sbtn155 -np 1 hetero.x86_64 >>> --host sbtn211 -np 1 hetero.i686 >>> >>> $ mpirun -hetero --app appfile >>> Input array length : >>> 10000 >>> Receiving from proc 1 : OK >>> Receiving from proc 2 : [sbtn155:26386] *** Process received signal *** >>> [sbtn155:26386] Signal: Segmentation fault (11) >>> [sbtn155:26386] Signal code: Address not mapped (1) >>> [sbtn155:26386] Failing at address: 0x200627bd8 >>> [sbtn155:26386] [ 0] /lib64/libpthread.so.0 [0x3fa4e0e540] >>> [sbtn155:26386] [ 1] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so >>> [0x2aaaad8d7908] >>> [sbtn155:26386] [ 2] /tmp/openmpi/lib/openmpi/mca_btl_tcp.so >>> [0x2aaaae2fc6e3] >>> [sbtn155:26386] [ 3] /tmp/openmpi/lib/libopen-pal.so.0 [0x2aaaaafe39db] >>> [sbtn155:26386] [ 4] >>> /tmp/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2aaaaafd8b9e] >>> [sbtn155:26386] [ 5] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so >>> [0x2aaaad8d4b25] >>> [sbtn155:26386] [ 6] /tmp/openmpi/lib/libmpi.so.0(MPI_Recv+0x13b) >>> [0x2aaaaab30f9b] >>> [sbtn155:26386] [ 7] hetero.x86_64(main+0xde) [0x400cbe] >>> [sbtn155:26386] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3fa421e074] >>> [sbtn155:26386] [ 9] hetero.x86_64 [0x400b29] >>> [sbtn155:26386] *** End of error message *** >>> -------------------------------------------------------------------------- >>> mpirun noticed that process rank 0 with PID 26386 on node sbtn155 >>> exited on signal 11 (Segmentation fault). >>> -------------------------------------------------------------------------- >>> >>> Am I missing an option in order to run in heterogenous cluster ? >>> MPI_Send/Recv have limit array size when using heterogeneous cluster ? >>> Thanks for your help. Regards >>> >>> -- >>> ============================================ >>> M. TRINH Minh Hieu >>> CEA, IBEB, SBTN/LIRM, >>> F-30207 Bagnols-sur-Cèze, FRANCE >>> ============================================ >>> >>> <hetero.c.bz2>_______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >