Re: [OMPI users] MPI_Allgather with derived type crash
George -- When I run 10 copies on the same node with btl tcp,self (no sm or openib), valgrind reports the following to me (using ompi-1.4 branch HEAD): ==23753== Invalid write of size 1 ==23753==at 0x4C6EA31: non_overlap_copy_content_same_ddt (dt_copy.h:170) ==23753==by 0x4C6CC3B: ompi_ddt_copy_content_same_ddt (dt_copy.c:95) ==23753==by 0xE873C82: ompi_coll_tuned_allgather_intra_bruck (coll_tuned_allgather.c:186) ==23753==by 0xE86B9FC: ompi_coll_tuned_allgather_intra_dec_fixed (coll_tuned_decision_fixed.c:561) ==23753==by 0x4C7D104: PMPI_Allgather (pallgather.c:114) ==23753==by 0x400DC0: main (andrew.c:58) ==23753== Address 0xeef4832 is 6 bytes after a block of size 124 alloc'd ==23753==at 0x4A05793: calloc (vg_replace_malloc.c:467) ==23753==by 0x51FBCAB: opal_calloc (malloc.c:131) ==23753==by 0xE873C22: ompi_coll_tuned_allgather_intra_bruck (coll_tuned_allgather.c:177) ==23753==by 0xE86B9FC: ompi_coll_tuned_allgather_intra_dec_fixed (coll_tuned_decision_fixed.c:561) ==23753==by 0x4C7D104: PMPI_Allgather (pallgather.c:114) ==23753==by 0x400DC0: main (andrew.c:58) I get lots of warnings like these (maybe 2 per process?). In valgrind, it eventually completes, but without valgrind it definitely crashes. I've filed #2805 with this issue: https://svn.open-mpi.org/trac/ompi/ticket/2805 On May 25, 2011, at 7:16 AM, Andrew Senin wrote: > Hello list, > > I have an application which uses MPI_Allgather with derived types. It works > correctly with mpich2 and mvapich2. However it crashes periodically with > openmpi2. After investigation I found that the crash takes place when I use > derived datatypes with MPI_AllGather and number of ranks greater than 8. I’ve > written a simple application which demonstrates the crash. It simply calls > for MPI_Allgather with derived datatype that consists of 1 shifted integer . > The sample works correctly with number of ranks 2-8. But when number of ranks > is greater than 8 it crashes with segmentation fault inside MPI_Type_free, > MPI_Allgather or MPI_Type_create_struct functions. This sample also works > correctly with mv2, mpich2 with any number of ranks. Is this a limitation of > ompi allgather? > > Crashed output: > Press any key... > Press any key... > Press any key... > Press any key... > Press any key... > Press any key... > Press any key... > Press any key... > Press any key... > [amd1:24260] *** Process received signal *** > [amd1:24260] Signal: Segmentation fault (11) > [amd1:24260] Signal code: Address not mapped (1) > [amd1:24260] Failing at address: 0x18 > [amd1:24262] *** Process received signal *** > [amd1:24262] Signal: Segmentation fault (11) > [amd1:24262] Signal code: Address not mapped (1) > [amd1:24262] Failing at address: 0x18 > [amd1:24258] *** Process received signal *** > [amd1:24258] Signal: Segmentation fault (11) > [amd1:24258] Signal code: Address not mapped (1) > [amd1:24258] Failing at address: 0x18 > [amd1:24260] [ 0] /lib64/libpthread.so.0 [0x3d6b20eb10] > [amd1:24260] [ 1] /lib64/libc.so.6 [0x3d6a671d80] > [amd1:24260] [ 2] /lib64/libc.so.6(cfree+0x4b) [0x3d6a67276b] > [amd1:24260] [ 3] > /hpc/home/USERS/senina/projects/openmpi-1.4.3/install/lib/libopen-pal.so.0(opal_free+0x4e) > [0x2ae52f5836bd] > [amd1:24260] [ 4] > /hpc/home/USERS/senina/projects/openmpi-1.4.3/install/lib/libmpi.so.1 > [0x2ae52efd05aa] > [amd1:24260] [ 5] > /hpc/home/USERS/senina/projects/openmpi-1.4.3/install/lib/libmpi.so.1 > [0x2ae52efd1e20] > [amd1:24260] [ 6] > /hpc/home/USERS/senina/projects/openmpi-1.4.3/install/lib/libmpi.so.1(ompi_ddt_destroy+0xe3) > [0x2ae52efd1d7b] > [amd1:24260] [ 7] > /hpc/home/USERS/senina/projects/openmpi-1.4.3/install/lib/libmpi.so.1(MPI_Type_free+0xf0) > [0x2ae52f0202ec] > [amd1:24260] [ 8] ./gather_openmpi_153(main+0xef) [0x400dc8] > [amd1:24260] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3d6a61d994] > [amd1:24260] [10] ./gather_openmpi_153 [0x400ba9] > [amd1:24260] *** End of error message *** > [amd1:24262] [ 0] /lib64/libpthread.so.0 [0x3d6b20eb10] > [amd1:24262] [ 1] /lib64/libc.so.6 [0x3d6a671d80] > [amd1:24262] [ 2] /lib64/libc.so.6(cfree+0x4b) [0x3d6a67276b] > [amd1:24262] [ 3] > /hpc/home/USERS/senina/projects/openmpi-1.4.3/install/lib/libopen-pal.so.0(opal_free+0x4e) > [0x2aedeea596bd] > [amd1:24262] [ 4] > /hpc/home/USERS/senina/projects/openmpi-1.4.3/install/lib/libmpi.so.1 > [0x2aedee4a65aa] > [amd1:24262] [ 5] > /hpc/home/USERS/senina/projects/openmpi-1.4.3/install/lib/libmpi.so.1 > [0x2aedee4a7e20] > [amd1:24262] [ 6] > /hpc/home/USERS/senina/projects/openmpi-1.4.3/install/lib/libmpi.so.1(ompi_ddt_destroy+0xe3) > [0x2aedee4a7d7b] > [amd1:24262] [ 7] > /hpc/home/USERS/senina/projects/openmpi-1.4.3/install/lib/libmpi.so.1(MPI_Type_free+0xf0) > [0x2aedee4f62ec] > [amd1:24262] [ 8] ./gather_openmpi_153(main+0xef) [0x400dc8] > [amd1:24262] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3d6a61d994] > [amd1:24262] [10]
Re: [OMPI users] MPI_Allgather with derived type crash
I've tried on my home Ubuntu 10.04, 64 bit version. It crashes with number of ranks 5-7, 9 and greater. I simply downloaded 1.4.3 version ( http://www.open-mpi.org/software/ompi/v1.4/downloads/openmpi-1.4.3.tar.gz): - configure --prefix=`pwd`/install && make install - cd ~/projects/gather - ~/projects/distribs/openmpi-1.4.3/install/bin/mpicc -o gather ./gather.c - ~/projects/distribs/openmpi-1.4.3/install/bin/mpirun -n 9 ./gather - crash! -Andrew On Wed, May 25, 2011 at 10:48 PM, Andrew Seninwrote: > Not exactly. I have 16 core nodes. Even if I run all 9 ranks on the same > node it fails (with --mca btl sm,self). I also tried running on different > nodes (3 nodes, 3 ranks each on each node) with openib and tcp - the same > effect. Also as I wrote in another message I could see this effect on vbox > with CentOS 5.3 (1 cores on guest, 4 cores on host, no network). So > possibly > this is something OS specific? Will try on Ubuntu and share the results. > > Regards, > Andrew >
Re: [OMPI users] MPI_Allgather with derived type crash
Andrew, I have a 8 octo-core nodes running under Caos NSA release 1.0.29 (Cato) 2009.11.13, connected with IB. I run your test one process per core, with different distributions and all gave the same result. george. On May 25, 2011, at 14:35 , Andrew Senin wrote: > Hi George, > > Thanks a lot for your attempt! Possibly this is something OS specific? I'm > using CentOS release 5.4 x86_64 on the cluster. I also tried it on my > virtual box with CentOS 5.3 x86_64 (ompi 1.4.3). The same effect. On what OS > did you try? If it helps I can upload the virtual box image on my web site. > > -Andrew > >> -Original Message- >> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On >> Behalf Of George Bosilca >> Sent: Wednesday, May 25, 2011 8:47 PM >> To: Open MPI Users >> Subject: Re: [OMPI users] MPI_Allgather with derived type crash >> >> Andrew, >> >> I tried with a freshly installed 1.4.3 but I can't reproduce your >> issue. I tried with the 1.5 and the trunk and all complete your code >> without errors. Not even valgrind found anything to complain about ... >> >> george. >> >> >> On May 25, 2011, at 08:22 , Andrew Senin wrote: >> >>> Sorry. I'm using OpenMPI 1.4.3. >>> >>> Thanks, >>> -Andrew >>> >>>> -Original Message- >>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] >> On >>>> Behalf Of Peter Kjellstrom >>>> Sent: Wednesday, May 25, 2011 4:19 PM >>>> To: us...@open-mpi.org >>>> Subject: Re: [OMPI users] MPI_Allgather with derived type crash >>>> >>>> On Wednesday, May 25, 2011 01:16:04 PM Andrew Senin wrote: >>>>> Hello list, >>>>> >>>>> I have an application which uses MPI_Allgather with derived types. >> It >>>>> works correctly with mpich2 and mvapich2. However it crashes >>>>> periodically with openmpi2. >>>> >>>> Which version of OpenMPI are you using? There is no such thing as >>>> openmpi2... >>>> >>>> /Peter >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> George Bosilca >> Research Assistant Professor >> Innovative Computing Laboratory >> Department of Electrical Engineering and Computer Science >> University of Tennessee, Knoxville >> http://web.eecs.utk.edu/~bosilca/ >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users George Bosilca Research Assistant Professor Innovative Computing Laboratory Department of Electrical Engineering and Computer Science University of Tennessee, Knoxville http://web.eecs.utk.edu/~bosilca/
Re: [OMPI users] MPI_Allgather with derived type crash
Not exactly. I have 16 core nodes. Even if I run all 9 ranks on the same node it fails (with --mca btl sm,self). I also tried running on different nodes (3 nodes, 3 ranks each on each node) with openib and tcp - the same effect. Also as I wrote in another message I could see this effect on vbox with CentOS 5.3 (1 cores on guest, 4 cores on host, no network). So possibly this is something OS specific? Will try on Ubuntu and share the results. Regards, Andrew > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Peter Kjellstrom > Sent: Wednesday, May 25, 2011 9:03 PM > To: us...@open-mpi.org > Subject: Re: [OMPI users] MPI_Allgather with derived type crash > > Would 8 happen to be the number of cores you have per node so what > we're seeing is: single node OK, multi node FAIL? > > If so what kind of inter node network are you (trying to) use(ing)? > > /Peter
Re: [OMPI users] MPI_Allgather with derived type crash
Hi George, Thanks a lot for your attempt! Possibly this is something OS specific? I'm using CentOS release 5.4 x86_64 on the cluster. I also tried it on my virtual box with CentOS 5.3 x86_64 (ompi 1.4.3). The same effect. On what OS did you try? If it helps I can upload the virtual box image on my web site. -Andrew > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of George Bosilca > Sent: Wednesday, May 25, 2011 8:47 PM > To: Open MPI Users > Subject: Re: [OMPI users] MPI_Allgather with derived type crash > > Andrew, > > I tried with a freshly installed 1.4.3 but I can't reproduce your > issue. I tried with the 1.5 and the trunk and all complete your code > without errors. Not even valgrind found anything to complain about ... > > george. > > > On May 25, 2011, at 08:22 , Andrew Senin wrote: > > > Sorry. I'm using OpenMPI 1.4.3. > > > > Thanks, > > -Andrew > > > >> -Original Message- > >> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] > On > >> Behalf Of Peter Kjellstrom > >> Sent: Wednesday, May 25, 2011 4:19 PM > >> To: us...@open-mpi.org > >> Subject: Re: [OMPI users] MPI_Allgather with derived type crash > >> > >> On Wednesday, May 25, 2011 01:16:04 PM Andrew Senin wrote: > >>> Hello list, > >>> > >>> I have an application which uses MPI_Allgather with derived types. > It > >>> works correctly with mpich2 and mvapich2. However it crashes > >>> periodically with openmpi2. > >> > >> Which version of OpenMPI are you using? There is no such thing as > >> openmpi2... > >> > >> /Peter > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > George Bosilca > Research Assistant Professor > Innovative Computing Laboratory > Department of Electrical Engineering and Computer Science > University of Tennessee, Knoxville > http://web.eecs.utk.edu/~bosilca/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_Allgather with derived type crash
On Wednesday, May 25, 2011 01:16:04 PM Andrew Senin wrote: > Hello list, > > I have an application which uses MPI_Allgather with derived types. It works > correctly with mpich2 and mvapich2. However it crashes periodically with > openmpi2. After investigation I found that the crash takes place when I use > derived datatypes with MPI_AllGather and number of ranks greater than 8. Would 8 happen to be the number of cores you have per node so what we're seeing is: single node OK, multi node FAIL? If so what kind of inter node network are you (trying to) use(ing)? /Peter signature.asc Description: This is a digitally signed message part.
Re: [OMPI users] MPI_Allgather with derived type crash
Andrew, I tried with a freshly installed 1.4.3 but I can't reproduce your issue. I tried with the 1.5 and the trunk and all complete your code without errors. Not even valgrind found anything to complain about ... george. On May 25, 2011, at 08:22 , Andrew Senin wrote: > Sorry. I'm using OpenMPI 1.4.3. > > Thanks, > -Andrew > >> -Original Message- >> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On >> Behalf Of Peter Kjellstrom >> Sent: Wednesday, May 25, 2011 4:19 PM >> To: us...@open-mpi.org >> Subject: Re: [OMPI users] MPI_Allgather with derived type crash >> >> On Wednesday, May 25, 2011 01:16:04 PM Andrew Senin wrote: >>> Hello list, >>> >>> I have an application which uses MPI_Allgather with derived types. It >>> works correctly with mpich2 and mvapich2. However it crashes >>> periodically with openmpi2. >> >> Which version of OpenMPI are you using? There is no such thing as >> openmpi2... >> >> /Peter > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users George Bosilca Research Assistant Professor Innovative Computing Laboratory Department of Electrical Engineering and Computer Science University of Tennessee, Knoxville http://web.eecs.utk.edu/~bosilca/
Re: [OMPI users] MPI_Allgather with derived type crash
Sorry. I'm using OpenMPI 1.4.3. Thanks, -Andrew > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Peter Kjellstrom > Sent: Wednesday, May 25, 2011 4:19 PM > To: us...@open-mpi.org > Subject: Re: [OMPI users] MPI_Allgather with derived type crash > > On Wednesday, May 25, 2011 01:16:04 PM Andrew Senin wrote: > > Hello list, > > > > I have an application which uses MPI_Allgather with derived types. It > > works correctly with mpich2 and mvapich2. However it crashes > > periodically with openmpi2. > > Which version of OpenMPI are you using? There is no such thing as > openmpi2... > > /Peter
Re: [OMPI users] MPI_Allgather with derived type crash
On Wednesday, May 25, 2011 01:16:04 PM Andrew Senin wrote: > Hello list, > > I have an application which uses MPI_Allgather with derived types. It works > correctly with mpich2 and mvapich2. However it crashes periodically with > openmpi2. Which version of OpenMPI are you using? There is no such thing as openmpi2... /Peter signature.asc Description: This is a digitally signed message part.