Since we don't have an obvious answer for this, I have filed ticket 2236 to 
track this issue:

    https://svn.open-mpi.org/trac/ompi/ticket/2336


On Mar 8, 2010, at 6:56 AM, TRINH Minh Hieu wrote:

> 
> Hello,
> 
> I changed the test code (hetero.c, in attach) so that the master (where data 
> is centralized) can be rank 1 or 2. 
> I tested with a master,rank 2 or rank 1 : same probleme, when the master is a 
> 64bit machine, as soon as it receive data from a 32bit machines it got 
> segfault. no probleme with a 32bit master. It seems to not be rank dependent 
> ... 
> 
> Regards,
> 
> 
> On Mon, Mar 8, 2010 at 1:27 PM, Terry Dontje <terry.don...@oracle.com> wrote:
> We (Oracle) have not done that much extensive limits testing going between 32 
> to 64bit applications.  Most of the testing we've done is more around 
> endianess (SPARC vs x86_64).
> 
> Though the below is kind of interesting.  Sounds like the eager limit isn't 
> being normalized on the 64 bit machines.  Though a 32 bit rank 0 solving the 
> problem also is very interesting, I wonder if that is not more due to which 
> rank is send and receiving?
> 
> --td
> 
> 
> 
> Message: 3
> Date: Sun, 7 Mar 2010 05:34:21 -0600
> From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com>
> Subject: Re: [OMPI users] Segmentation fault when Send/Recv
>        onheterogeneouscluster (32/64 bit machines)
> To: <us...@open-mpi.org>
> Message-ID:
>        <58d723fe08dc6a4398e6596e38f3fa17056...@xmb-rcd-205.cisco.com>
> Content-Type: text/plain; charset="utf-8"
> 
> Ibm and sun (oracle) have probably done the most heterogeneous testing, but 
> its probably not as stable as our homogeneous code paths.
> 
> Terry/brad - do you have any insight here?
> 
> Yes, setting eager limit high can impact performance. Its the amount of data 
> that ompi will send eagerly without waiting for an ack from the receiver. 
> There are several secondary performance effects that can occur if you are 
> using sockets for transport and/or your program is only loosely synchronized. 
> If your prog is tightly synchronous, it may not have too huge of an overall 
> perf impact. 
> -jms
> Sent from my PDA.  No type good.
> 
> ----- Original Message -----
> From: users-boun...@open-mpi.org <users-boun...@open-mpi.org>
> To: Open MPI Users <us...@open-mpi.org>
> Sent: Thu Mar 04 09:02:19 2010
> Subject: Re: [OMPI users] Segmentation fault when Send/Recv 
> onheterogeneouscluster (32/64 bit machines)
> 
> Hi,
> 
> I have some new discovery about this problem :
> 
> It seems that the array size sendable from a 32bit to 64bit machines
> is proportional to the parameter "btl_tcp_eager_limit"
> When I set it to 200 000 000 (2e08 bytes, about 190MB), I can send an
> array up to 2e07 double (152MB).
> 
> I didn't found much informations about btl_tcp_eager_limit other than
> in the "ompi_info --all" command. If I let it at 2e08, will it impacts
> the performance of OpenMPI ?
> 
> It may be noteworth also that if the master (rank 0) is a 32bit
> machines, I don't have segfault. I can send big array with small
> "btl_tcp_eager_limit" from a 64bit machine to a 32bit one.
> 
> Do I have to move this thread to devel mailing list ?
> 
> Regards,
> 
>   TMHieu
> 
> On Tue, Mar 2, 2010 at 2:54 PM, TRINH Minh Hieu <mhtr...@gmail.com> wrote:
>  
> Hello,
> 
> Yes, I compiled OpenMPI with --enable-heterogeneous. More precisely I
> compiled with :
> $ ./configure --prefix=/tmp/openmpi --enable-heterogeneous
> --enable-cxx-exceptions --enable-shared
> --enable-orterun-prefix-by-default
> $ make all install
> 
> I attach the output of ompi_info of my 2 machines.
> 
> ? ?TMHieu
> 
> On Tue, Mar 2, 2010 at 1:57 PM, Jeff Squyres <jsquy...@cisco.com> wrote:
>    
> Did you configure Open MPI with --enable-heterogeneous?
> 
> On Feb 28, 2010, at 1:22 PM, TRINH Minh Hieu wrote:
> 
>      
> Hello,
> 
> I have some problems running MPI on my heterogeneous cluster. More
> precisley i got segmentation fault when sending a large array (about
> 10000) of double from a i686 machine to a x86_64 machine. It does not
> happen with small array. Here is the send/recv code source (complete
> source is in attached file) :
> ========code ================
> ? ? if (me == 0 ) {
> ? ? ? ? for (int pe=1; pe<nprocs; pe++)
> ? ? ? ? {
> ? ? ? ? ? ? ? ? printf("Receiving from proc %d : ",pe); fflush(stdout);
> ? ? ? ? ? ? d=(double *)malloc(sizeof(double)*n);
> ? ? ? ? ? ? MPI_Recv(d,n,MPI_DOUBLE,pe,999,MPI_COMM_WORLD,&status);
> ? ? ? ? ? ? printf("OK\n"); fflush(stdout);
> ? ? ? ? }
> ? ? ? ? printf("All done.\n");
> ? ? }
> ? ? else {
> ? ? ? d=(double *)malloc(sizeof(double)*n);
> ? ? ? MPI_Send(d,n,MPI_DOUBLE,0,999,MPI_COMM_WORLD);
> ? ? }
> ======== code ================
> 
> I got segmentation fault with n=10000 but no error with n=1000
> I have 2 machines :
> sbtn155 : Intel Xeon, ? ? ? ? x86_64
> sbtn211 : Intel Pentium 4, i686
> 
> The code is compiled in x86_64 and i686 machine, using OpenMPI 1.4.1,
> installed in /tmp/openmpi :
> [mhtrinh@sbtn211 heterogenous]$ make hetero
> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o hetero.i686.o
> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include
> hetero.i686.o -o hetero.i686 -lm
> 
> [mhtrinh@sbtn155 heterogenous]$ make hetero
> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o 
> hetero.x86_64.o
> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include
> hetero.x86_64.o -o hetero.x86_64 -lm
> 
> I run with the code using appfile and got thoses error :
> $ cat appfile
> --host sbtn155 -np 1 hetero.x86_64
> --host sbtn155 -np 1 hetero.x86_64
> --host sbtn211 -np 1 hetero.i686
> 
> $ mpirun -hetero --app appfile
> Input array length :
> 10000
> Receiving from proc 1 : OK
> Receiving from proc 2 : [sbtn155:26386] *** Process received signal ***
> [sbtn155:26386] Signal: Segmentation fault (11)
> [sbtn155:26386] Signal code: Address not mapped (1)
> [sbtn155:26386] Failing at address: 0x200627bd8
> [sbtn155:26386] [ 0] /lib64/libpthread.so.0 [0x3fa4e0e540]
> [sbtn155:26386] [ 1] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so [0x2aaaad8d7908]
> [sbtn155:26386] [ 2] /tmp/openmpi/lib/openmpi/mca_btl_tcp.so [0x2aaaae2fc6e3]
> [sbtn155:26386] [ 3] /tmp/openmpi/lib/libopen-pal.so.0 [0x2aaaaafe39db]
> [sbtn155:26386] [ 4]
> /tmp/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2aaaaafd8b9e]
> [sbtn155:26386] [ 5] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so [0x2aaaad8d4b25]
> [sbtn155:26386] [ 6] /tmp/openmpi/lib/libmpi.so.0(MPI_Recv+0x13b)
> [0x2aaaaab30f9b]
> [sbtn155:26386] [ 7] hetero.x86_64(main+0xde) [0x400cbe]
> [sbtn155:26386] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3fa421e074]
> [sbtn155:26386] [ 9] hetero.x86_64 [0x400b29]
> [sbtn155:26386] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 26386 on node sbtn155
> exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> 
> Am I missing an option in order to run in heterogenous cluster ?
> MPI_Send/Recv have limit array size when using heterogeneous cluster ?
> Thanks for your help. Regards
> 
> --
> ============================================
> ? ?M. TRINH Minh Hieu
> ? ?CEA, IBEB, SBTN/LIRM,
> ? ?F-30207 Bagnols-sur-C?ze, FRANCE
> ============================================
> 
> <hetero.c.bz2>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>        
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>      
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> -------------- next part --------------
> HTML attachment scrubbed and removed
> 
>  **************************************
>  
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> -- 
> ============================================
>   M. TRINH Minh Hieu
>   CEA, IBEB, SBTN/LIRM,
>   F-30207 Bagnols-sur-Cèze, FRANCE
> ============================================
> <hetero.c>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to