Re: [OMPI users] [OMP users]: OpenMP1.4 tuning for sending large messages

2010-04-27 Thread Timur Magomedov
Hello,
Are you using heterogeneous environment? There was a similar issue
recently with segfault on mixed x86 and x86_64 environment. Here is
corresponding thread in ompi-devel:
http://www.open-mpi.org/community/lists/devel/2010/04/7787.php
This was fixed in trunk and will likely be fixed in next 1.4 release.
You can download last trunk snapshot from here
http://www.open-mpi.org/nightly/trunk/
and test it.

В Пнд, 26/04/2010 в 15:28 -0400, Pooja Varshneya пишет:
> Hi All,
> 
> I am using OpenMPI 1.4 on a cluster of Intel quad-core processors  
> running Linux and connected by ethernet.
> 
> In an application, i m trying  to send and receive large messages of  
> sizes ranging from 1 KB upto 500 MB.
> The application works fine if the messages sizes are within 1 MB  
> range. When i try to send larger size messages, application crashes  
> with segmentation fault. I have tried to increase the size of btl_tcp  
> send and receive buffer, but it does not seem to be working.
> 
> Are there any other settings i need to change to enable large messages  
> to be sent ?
> I am using boost serialization and boost mpi libraries to simplify  
> message packing and unpacking.
> 
> mpirun -np 3  --mca btl_tcp_eager_limit 536870912 --mca  
> btl_tcp_max_send_size 536870912 --mca  
> btl_tcp_rdma_pipeline_send_length 524288 --mca btl_tcp_sndbuf  
> 536870912 --mca btl_tcp_rcvbuf
> 536870912 --hostfile hostfile2 --rankfile rankfile2  ./ 
> boost_binomial_no_LB
> 
> 
> [rh5x64-u16:25446] *** Process received signal ***
> [rh5x64-u16:25446] Signal: Segmentation fault (11)
> [rh5x64-u16:25446] Signal code: Address not mapped (1)
> [rh5x64-u16:25446] Failing at address: 0x2b12d14aafdc
> [rh5x64-u16:25446] [ 0] /lib64/libpthread.so.0 [0x3ba680e7c0]
> [rh5x64-u16:25446] [ 1] /lib64/libc.so.6(memcpy+0xa0) [0x3ba5c7be50]
> [rh5x64-u16:25446] [ 2] /usr/local/lib/libmpi.so.0 [0x2b11ccbe0c02]
> [rh5x64-u16:25446] [ 3] /usr/local/lib/libmpi.so.0(ompi_convertor_pack 
> +0x160) [0x2b11ccbe4930]
> [rh5x64-u16:25446] [ 4] /usr/local/lib/openmpi/mca_btl_tcp.so  
> [0x2b11cffcaf67]
> [rh5x64-u16:25446] [ 5] /usr/local/lib/openmpi/mca_pml_ob1.so  
> [0x2b11cf5af97a]
> [rh5x64-u16:25446] [ 6] /usr/local/lib/openmpi/mca_pml_ob1.so  
> [0x2b11cf5a9b0d]
> [rh5x64-u16:25446] [ 7] /usr/local/lib/openmpi/mca_btl_tcp.so  
> [0x2b11cffcd693]
> [rh5x64-u16:25446] [ 8] /usr/local/lib/libopen-pal.so.0 [0x2b11cd0ab95b]
> [rh5x64-u16:25446] [ 9] /usr/local/lib/libopen-pal.so.0(opal_progress 
> +0x9e) [0x2b11cd0a0b3e]
> [rh5x64-u16:25446] [10] /usr/local/lib/libmpi.so.0 [0x2b11ccbd62c9]
> [rh5x64-u16:25446] [11] /usr/local/lib/libmpi.so.0(PMPI_Test+0x73)  
> [0x2b11ccbfc863]
> [rh5x64-u16:25446] [12] /usr/local/lib/libboost_mpi.so. 
> 1.42.0(_ZN5boost3mpi7request4testEv+0x13d) [0x2b11cc50451d]
> [rh5x64-u16:25446] [13] ./ 
> boost_binomial_no_LB(_ZN5boost3mpi8wait_allIPNS0_7requestEEEvT_S4_ 
> +0x19d) [0x42206d]
> [rh5x64-u16:25446] [14] ./boost_binomial_no_LB [0x41c82a]
> [rh5x64-u16:25446] [15] ./boost_binomial_no_LB(main+0x169) [0x41d4a9]
> [rh5x64-u16:25446] [16] /lib64/libc.so.6(__libc_start_main+0xf4)  
> [0x3ba5c1d994]
> [rh5x64-u16:25446] [17] ./ 
> boost_binomial_no_LB(__gxx_personality_v0+0x371) [0x41a799]
> [rh5x64-u16:25446] *** End of error message ***
> --
> mpirun noticed that process rank 0 with PID 25446 on node 172.10.0.116  
> exited on signal 11 (Segmentation fault).
> ------
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Kind regards,
Timur Magomedov
Senior C++ Developer
DevelopOnBox LLC / Zodiac Interactive
http://www.zodiac.tv/



Re: [OMPI users] Segmentation fault when Send/Recv on heterogeneouscluster (32/64 bit machines)

2010-04-26 Thread Timur Magomedov
Hello,
You can get nightly trunk snapshot from here
http://www.open-mpi.org/nightly/trunk/
You can grab openmpi-1.7a1r23032 and test it. This will be great.

В Пнд, 26/04/2010 в 10:26 +0200, TRINH Minh Hieu пишет:
> 
> Hello,
> 
> I can help to test the patch if you need to. But I don't know much how
> to you svn to get the latest source to test.
> Regards
> 
>TMHieu
> 
> 
> Message: 1
> Date: Fri, 23 Apr 2010 20:15:58 +0400
> From: Timur Magomedov <timur.magome...@developonbox.ru>
> Subject: Re: [OMPI users] Segmentation fault when Send/Recv on
>heterogeneouscluster (32/64 bit machines)
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <1272039358.4818.137.camel@magomedov-desktop>
> Content-Type: text/plain; charset="UTF-8"
> 
> Hello,
> It seems that this was really a bug. It was recently fixed in
> repository
> https://svn.open-mpi.org/trac/ompi/changeset/23030
> and will likely be fixed in next 1.4 release.
> 
> Here is corresponding thread in ompi-devel:
> http://www.open-mpi.org/community/lists/devel/2010/04/7787.php
> 
>  
> -- 
> 
>   M. TRINH Minh Hieu
>   CEA, IBEB, SBTN/LIRM,
>   F-30207 Bagnols-sur-Cèze, FRANCE
> 
> _______
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Kind regards,
Timur Magomedov
Senior C++ Developer
DevelopOnBox LLC / Zodiac Interactive
http://www.zodiac.tv/



Re: [OMPI users] Segmentation fault when Send/Recv on heterogeneouscluster (32/64 bit machines)

2010-04-23 Thread Timur Magomedov
LD);
> >>>>> }
> >>>>>  code 
> >>>>>
> >>>>> I got segmentation fault with n=1 but no error with n=1000
> >>>>> I have 2 machines :
> >>>>> sbtn155 : Intel Xeon, x86_64
> >>>>> sbtn211 : Intel Pentium 4, i686
> >>>>>
> >>>>> The code is compiled in x86_64 and i686 machine, using OpenMPI
> 1.4.1,
> >>>>> installed in /tmp/openmpi :
> >>>>> [mhtrinh@sbtn211 heterogenous]$ make hetero
> >>>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o
> hetero.i686.o
> >>>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3
> -I/tmp/openmpi/include
> >>>>> hetero.i686.o -o hetero.i686 -lm
> >>>>>
> >>>>> [mhtrinh@sbtn155 heterogenous]$ make hetero
> >>>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o
> hetero.x86_64.o
> >>>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3
> -I/tmp/openmpi/include
> >>>>> hetero.x86_64.o -o hetero.x86_64 -lm
> >>>>>
> >>>>> I run with the code using appfile and got thoses error :
> >>>>> $ cat appfile
> >>>>> --host sbtn155 -np 1 hetero.x86_64
> >>>>> --host sbtn155 -np 1 hetero.x86_64
> >>>>> --host sbtn211 -np 1 hetero.i686
> >>>>>
> >>>>> $ mpirun -hetero --app appfile
> >>>>> Input array length :
> >>>>> 1
> >>>>> Receiving from proc 1 : OK
> >>>>> Receiving from proc 2 : [sbtn155:26386] *** Process received
> signal ***
> >>>>> [sbtn155:26386] Signal: Segmentation fault (11)
> >>>>> [sbtn155:26386] Signal code: Address not mapped (1)
> >>>>> [sbtn155:26386] Failing at address: 0x200627bd8
> >>>>> [sbtn155:26386] [ 0] /lib64/libpthread.so.0 [0x3fa4e0e540]
> >>>>> [sbtn155:26386] [ 1] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so
> [0x2d8d7908]
> >>>>> [sbtn155:26386] [ 2] /tmp/openmpi/lib/openmpi/mca_btl_tcp.so
> [0x2e2fc6e3]
> >>>>> [sbtn155:26386] [ 3] /tmp/openmpi/lib/libopen-pal.so.0
> [0x2afe39db]
> >>>>> [sbtn155:26386] [ 4]
> >>>>> /tmp/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e)
> [0x2afd8b9e]
> >>>>> [sbtn155:26386] [ 5] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so
> [0x2d8d4b25]
> >>>>> [sbtn155:26386] [ 6] /tmp/openmpi/lib/libmpi.so.0(MPI_Recv
> +0x13b)
> >>>>> [0x2ab30f9b]
> >>>>> [sbtn155:26386] [ 7] hetero.x86_64(main+0xde) [0x400cbe]
> >>>>> [sbtn155:26386] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x3fa421e074]
> >>>>> [sbtn155:26386] [ 9] hetero.x86_64 [0x400b29]
> >>>>> [sbtn155:26386] *** End of error message ***
> >>>>>
> --
> >>>>> mpirun noticed that process rank 0 with PID 26386 on node
> sbtn155
> >>>>> exited on signal 11 (Segmentation fault).
> >>>>>
> --
> >>>>>
> >>>>> Am I missing an option in order to run in heterogenous cluster ?
> >>>>> MPI_Send/Recv have limit array size when using heterogeneous
> cluster ?
> >>>>> Thanks for your help. Regards
> >>>>>
> >>>>> --
> >>>>> 
> >>>>>M. TRINH Minh Hieu
> >>>>>CEA, IBEB, SBTN/LIRM,
> >>>>>F-30207 Bagnols-sur-Cèze, FRANCE
> >>>>> 
> >>>>>
> >>>>> ___
> >>>>> users mailing list
> >>>>> us...@open-mpi.org
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>>
> >>>> --
> >>>> Jeff Squyres
> >>>> jsquy...@cisco.com
> >>>> For corporate legal information go to:
> >>>> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>>>
> >>>>
> >>>> ___
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> 
> 
> 
> -- 
> 
>   M. TRINH Minh Hieu
>   CEA, IBEB, SBTN/LIRM,
>   F-30207 Bagnols-sur-Cèze, FRANCE
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Kind regards,
Timur Magomedov
Senior C++ Developer
DevelopOnBox LLC / Zodiac Interactive
http://www.zodiac.tv/



Re: [OMPI users] Fwd: Open MPI v1.4 cant find default hostfile

2010-04-16 Thread Timur Magomedov
Hello.
It looks that you hostfile path should
be /usr/local/etc/openmpi-default-hostfile not
usr/local/etc/openmpi-default-hostfile but somehow Open MPI gets the
second path.

В Птн, 16/04/2010 в 19:10 +0200, Mario Ogrizek пишет:
> Well, im not sure why should i name it /openmpi-default-hostfile
> Especially, because mpirun v1.2 executes without any errors.
> But, i made a copy named /openmpi-default-hostfile, and still, the
> same result.
> 
> This is the whole error message for a simple hello world program: 
> 
> 
> Open RTE was unable to open the hostfile:
> usr/local/etc/openmpi-default-hostfile
> Check to make sure the path and filename are correct.
> --
> [Mario.local:04300] [[114,0],0] ORTE_ERROR_LOG: Not found in file
> base/ras_base_allocate.c at line 186
> [Mario.local:04300] [[114,0],0] ORTE_ERROR_LOG: Not found in file
> base/plm_base_launch_support.c at line 72
> [Mario.local:04300] [[114,0],0] ORTE_ERROR_LOG: Not found in file
> plm_rsh_module.c at line 990
> --
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting
> to
> launch so we are aborting.
> 
> 
> There may be more information reported by the environment (see above).
> 
> 
> This may be because the daemon was unable to find all the needed
> shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
> the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> mpirun: clean termination accomplished
> 
> 
> 
> 
> ps. PTP is a parallel tools platform plugin for eclipse 
> 
> 
> Regards,
> 
> 
> Mario
> 
> _______
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Kind regards,
Timur Magomedov
Senior C++ Developer
DevelopOnBox LLC / Zodiac Interactive
http://www.zodiac.tv/



Re: [OMPI users] Segmentation fault when Send/Recv on heterogeneous cluster (32/64 bit machines)

2010-03-01 Thread Timur Magomedov
Hello.
It looks like you allocate memory in every loop iteration on process #0
and doesn't free it so malloc fails on some iteration.

В Вск, 28/02/2010 в 19:22 +0100, TRINH Minh Hieu пишет:
> Hello,
> 
> I have some problems running MPI on my heterogeneous cluster. More
> precisley i got segmentation fault when sending a large array (about
> 1) of double from a i686 machine to a x86_64 machine. It does not
> happen with small array. Here is the send/recv code source (complete
> source is in attached file) :
> code 
> if (me == 0 ) {
>   for (int pe=1; pe<nprocs; pe++)
>   {
>   printf("Receiving from proc %d : ",pe); fflush(stdout);
>   d=(double *)malloc(sizeof(double)*n);
>   MPI_Recv(d,n,MPI_DOUBLE,pe,999,MPI_COMM_WORLD,);
>   printf("OK\n"); fflush(stdout);
>   }
>   printf("All done.\n");
> }
> else {
>   d=(double *)malloc(sizeof(double)*n);
>   MPI_Send(d,n,MPI_DOUBLE,0,999,MPI_COMM_WORLD);
> }
>  code 
> 
> I got segmentation fault with n=1 but no error with n=1000
> I have 2 machines :
> sbtn155 : Intel Xeon, x86_64
> sbtn211 : Intel Pentium 4, i686
> 
> The code is compiled in x86_64 and i686 machine, using OpenMPI 1.4.1,
> installed in /tmp/openmpi :
> [mhtrinh@sbtn211 heterogenous]$ make hetero
> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o hetero.i686.o
> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include
> hetero.i686.o -o hetero.i686 -lm
> 
> [mhtrinh@sbtn155 heterogenous]$ make hetero
> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o 
> hetero.x86_64.o
> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include
> hetero.x86_64.o -o hetero.x86_64 -lm
> 
> I run with the code using appfile and got thoses error :
> $ cat appfile
> --host sbtn155 -np 1 hetero.x86_64
> --host sbtn155 -np 1 hetero.x86_64
> --host sbtn211 -np 1 hetero.i686
> 
> $ mpirun -hetero --app appfile
> Input array length :
> 1
> Receiving from proc 1 : OK
> Receiving from proc 2 : [sbtn155:26386] *** Process received signal ***
> [sbtn155:26386] Signal: Segmentation fault (11)
> [sbtn155:26386] Signal code: Address not mapped (1)
> [sbtn155:26386] Failing at address: 0x200627bd8
> [sbtn155:26386] [ 0] /lib64/libpthread.so.0 [0x3fa4e0e540]
> [sbtn155:26386] [ 1] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so [0x2d8d7908]
> [sbtn155:26386] [ 2] /tmp/openmpi/lib/openmpi/mca_btl_tcp.so [0x2e2fc6e3]
> [sbtn155:26386] [ 3] /tmp/openmpi/lib/libopen-pal.so.0 [0x2afe39db]
> [sbtn155:26386] [ 4]
> /tmp/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2afd8b9e]
> [sbtn155:26386] [ 5] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so [0x2d8d4b25]
> [sbtn155:26386] [ 6] /tmp/openmpi/lib/libmpi.so.0(MPI_Recv+0x13b)
> [0x2ab30f9b]
> [sbtn155:26386] [ 7] hetero.x86_64(main+0xde) [0x400cbe]
> [sbtn155:26386] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3fa421e074]
> [sbtn155:26386] [ 9] hetero.x86_64 [0x400b29]
> [sbtn155:26386] *** End of error message ***
> --
> mpirun noticed that process rank 0 with PID 26386 on node sbtn155
> exited on signal 11 (Segmentation fault).
> --
> 
> Am I missing an option in order to run in heterogenous cluster ?
> MPI_Send/Recv have limit array size when using heterogeneous cluster ?
> Thanks for your help. Regards
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Kind regards,
Timur Magomedov
Senior C++ Developer
DevelopOnBox LLC / Zodiac Interactive
http://www.zodiac.tv/