Re: [OMPI users] MPI_Allgather with derived type crash

2011-05-25 Thread Jeff Squyres
George --

When I run 10 copies on the same node with btl tcp,self (no sm or openib), 
valgrind reports the following to me (using ompi-1.4 branch HEAD):

==23753== Invalid write of size 1
==23753==at 0x4C6EA31: non_overlap_copy_content_same_ddt (dt_copy.h:170)
==23753==by 0x4C6CC3B: ompi_ddt_copy_content_same_ddt (dt_copy.c:95)
==23753==by 0xE873C82: ompi_coll_tuned_allgather_intra_bruck 
(coll_tuned_allgather.c:186)
==23753==by 0xE86B9FC: ompi_coll_tuned_allgather_intra_dec_fixed 
(coll_tuned_decision_fixed.c:561)
==23753==by 0x4C7D104: PMPI_Allgather (pallgather.c:114)
==23753==by 0x400DC0: main (andrew.c:58)
==23753==  Address 0xeef4832 is 6 bytes after a block of size 124 alloc'd
==23753==at 0x4A05793: calloc (vg_replace_malloc.c:467)
==23753==by 0x51FBCAB: opal_calloc (malloc.c:131)
==23753==by 0xE873C22: ompi_coll_tuned_allgather_intra_bruck 
(coll_tuned_allgather.c:177)
==23753==by 0xE86B9FC: ompi_coll_tuned_allgather_intra_dec_fixed 
(coll_tuned_decision_fixed.c:561)
==23753==by 0x4C7D104: PMPI_Allgather (pallgather.c:114)
==23753==by 0x400DC0: main (andrew.c:58)

I get lots of warnings like these (maybe 2 per process?).  In valgrind, it 
eventually completes, but without valgrind it definitely crashes.

I've filed #2805 with this issue:

https://svn.open-mpi.org/trac/ompi/ticket/2805




On May 25, 2011, at 7:16 AM, Andrew Senin wrote:

> Hello list,
>  
> I have an application which uses MPI_Allgather with derived types. It works 
> correctly with mpich2 and mvapich2. However it crashes periodically with 
> openmpi2. After investigation I found that the crash takes place when I use 
> derived datatypes with MPI_AllGather and number of ranks greater than 8. I’ve 
> written a simple application which demonstrates the crash. It simply calls 
> for MPI_Allgather with derived datatype that consists of 1 shifted integer . 
> The sample works correctly with number of ranks 2-8. But when number of ranks 
> is greater than 8 it crashes with segmentation fault inside MPI_Type_free, 
> MPI_Allgather or MPI_Type_create_struct functions. This sample also works 
> correctly with mv2, mpich2 with any number of ranks. Is this a limitation of 
> ompi allgather?
>  
> Crashed output:
> Press any key...
> Press any key...
> Press any key...
> Press any key...
> Press any key...
> Press any key...
> Press any key...
> Press any key...
> Press any key...
> [amd1:24260] *** Process received signal ***
> [amd1:24260] Signal: Segmentation fault (11)
> [amd1:24260] Signal code: Address not mapped (1)
> [amd1:24260] Failing at address: 0x18
> [amd1:24262] *** Process received signal ***
> [amd1:24262] Signal: Segmentation fault (11)
> [amd1:24262] Signal code: Address not mapped (1)
> [amd1:24262] Failing at address: 0x18
> [amd1:24258] *** Process received signal ***
> [amd1:24258] Signal: Segmentation fault (11)
> [amd1:24258] Signal code: Address not mapped (1)
> [amd1:24258] Failing at address: 0x18
> [amd1:24260] [ 0] /lib64/libpthread.so.0 [0x3d6b20eb10]
> [amd1:24260] [ 1] /lib64/libc.so.6 [0x3d6a671d80]
> [amd1:24260] [ 2] /lib64/libc.so.6(cfree+0x4b) [0x3d6a67276b]
> [amd1:24260] [ 3] 
> /hpc/home/USERS/senina/projects/openmpi-1.4.3/install/lib/libopen-pal.so.0(opal_free+0x4e)
>  [0x2ae52f5836bd]
> [amd1:24260] [ 4] 
> /hpc/home/USERS/senina/projects/openmpi-1.4.3/install/lib/libmpi.so.1 
> [0x2ae52efd05aa]
> [amd1:24260] [ 5] 
> /hpc/home/USERS/senina/projects/openmpi-1.4.3/install/lib/libmpi.so.1 
> [0x2ae52efd1e20]
> [amd1:24260] [ 6] 
> /hpc/home/USERS/senina/projects/openmpi-1.4.3/install/lib/libmpi.so.1(ompi_ddt_destroy+0xe3)
>  [0x2ae52efd1d7b]
> [amd1:24260] [ 7] 
> /hpc/home/USERS/senina/projects/openmpi-1.4.3/install/lib/libmpi.so.1(MPI_Type_free+0xf0)
>  [0x2ae52f0202ec]
> [amd1:24260] [ 8] ./gather_openmpi_153(main+0xef) [0x400dc8]
> [amd1:24260] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3d6a61d994]
> [amd1:24260] [10] ./gather_openmpi_153 [0x400ba9]
> [amd1:24260] *** End of error message ***
> [amd1:24262] [ 0] /lib64/libpthread.so.0 [0x3d6b20eb10]
> [amd1:24262] [ 1] /lib64/libc.so.6 [0x3d6a671d80]
> [amd1:24262] [ 2] /lib64/libc.so.6(cfree+0x4b) [0x3d6a67276b]
> [amd1:24262] [ 3] 
> /hpc/home/USERS/senina/projects/openmpi-1.4.3/install/lib/libopen-pal.so.0(opal_free+0x4e)
>  [0x2aedeea596bd]
> [amd1:24262] [ 4] 
> /hpc/home/USERS/senina/projects/openmpi-1.4.3/install/lib/libmpi.so.1 
> [0x2aedee4a65aa]
> [amd1:24262] [ 5] 
> /hpc/home/USERS/senina/projects/openmpi-1.4.3/install/lib/libmpi.so.1 
> [0x2aedee4a7e20]
> [amd1:24262] [ 6] 
> /hpc/home/USERS/senina/projects/openmpi-1.4.3/install/lib/libmpi.so.1(ompi_ddt_destroy+0xe3)
>  [0x2aedee4a7d7b]
> [amd1:24262] [ 7] 
> /hpc/home/USERS/senina/projects/openmpi-1.4.3/install/lib/libmpi.so.1(MPI_Type_free+0xf0)
>  [0x2aedee4f62ec]
> [amd1:24262] [ 8] ./gather_openmpi_153(main+0xef) [0x400dc8]
> [amd1:24262] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3d6a61d994]
> [amd1:24262] [10] 

Re: [OMPI users] MPI_Allgather with derived type crash

2011-05-25 Thread Andrew Senin
I've tried on my home Ubuntu 10.04, 64 bit version. It crashes with number
of ranks 5-7, 9 and greater. I simply downloaded 1.4.3 version (
http://www.open-mpi.org/software/ompi/v1.4/downloads/openmpi-1.4.3.tar.gz):

- configure --prefix=`pwd`/install && make install
- cd ~/projects/gather
- ~/projects/distribs/openmpi-1.4.3/install/bin/mpicc -o gather ./gather.c
- ~/projects/distribs/openmpi-1.4.3/install/bin/mpirun -n 9 ./gather
- crash!

-Andrew


On Wed, May 25, 2011 at 10:48 PM, Andrew Senin wrote:

> Not exactly. I have 16 core nodes. Even if I run all 9 ranks on the same
> node it fails (with --mca btl sm,self). I also tried running on different
> nodes (3 nodes, 3 ranks each on each node) with openib and tcp - the same
> effect. Also as I wrote in another message I could see this effect on vbox
> with CentOS 5.3 (1 cores on guest, 4 cores on host, no network). So
> possibly
> this is something OS specific? Will try on Ubuntu and share the results.
>
> Regards,
> Andrew
>


Re: [OMPI users] MPI_Allgather with derived type crash

2011-05-25 Thread George Bosilca
Andrew,

I have a 8 octo-core nodes running under Caos NSA release 1.0.29 (Cato) 
2009.11.13, connected with IB. I run your test one process per core, with 
different distributions and all gave the same result.

  george.

On May 25, 2011, at 14:35 , Andrew Senin wrote:

> Hi George, 
> 
> Thanks a lot for your attempt! Possibly this is something OS specific? I'm
> using CentOS release 5.4 x86_64 on the cluster. I also tried it on my
> virtual box with CentOS 5.3 x86_64 (ompi 1.4.3). The same effect. On what OS
> did you try? If it helps I can upload the virtual box image on my web site.
> 
> -Andrew
> 
>> -Original Message-
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>> Behalf Of George Bosilca
>> Sent: Wednesday, May 25, 2011 8:47 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] MPI_Allgather with derived type crash
>> 
>> Andrew,
>> 
>> I tried with a freshly installed 1.4.3 but I can't reproduce your
>> issue. I tried with the 1.5 and the trunk and all complete your code
>> without errors. Not even valgrind found anything to complain about ...
>> 
>>  george.
>> 
>> 
>> On May 25, 2011, at 08:22 , Andrew Senin wrote:
>> 
>>> Sorry. I'm using OpenMPI 1.4.3.
>>> 
>>> Thanks,
>>> -Andrew
>>> 
>>>> -Original Message-
>>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>> On
>>>> Behalf Of Peter Kjellstrom
>>>> Sent: Wednesday, May 25, 2011 4:19 PM
>>>> To: us...@open-mpi.org
>>>> Subject: Re: [OMPI users] MPI_Allgather with derived type crash
>>>> 
>>>> On Wednesday, May 25, 2011 01:16:04 PM Andrew Senin wrote:
>>>>> Hello list,
>>>>> 
>>>>> I have an application which uses MPI_Allgather with derived types.
>> It
>>>>> works correctly with mpich2 and mvapich2. However it crashes
>>>>> periodically with openmpi2.
>>>> 
>>>> Which version of OpenMPI are you using? There is no such thing as
>>>> openmpi2...
>>>> 
>>>> /Peter
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> George Bosilca
>> Research Assistant Professor
>> Innovative Computing Laboratory
>> Department of Electrical Engineering and Computer Science
>> University of Tennessee, Knoxville
>> http://web.eecs.utk.edu/~bosilca/
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

George Bosilca
Research Assistant Professor
Innovative Computing Laboratory
Department of Electrical Engineering and Computer Science
University of Tennessee, Knoxville
http://web.eecs.utk.edu/~bosilca/




Re: [OMPI users] MPI_Allgather with derived type crash

2011-05-25 Thread Andrew Senin
Not exactly. I have 16 core nodes. Even if I run all 9 ranks on the same
node it fails (with --mca btl sm,self). I also tried running on different
nodes (3 nodes, 3 ranks each on each node) with openib and tcp - the same
effect. Also as I wrote in another message I could see this effect on vbox
with CentOS 5.3 (1 cores on guest, 4 cores on host, no network). So possibly
this is something OS specific? Will try on Ubuntu and share the results. 

Regards, 
Andrew  

> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Peter Kjellstrom
> Sent: Wednesday, May 25, 2011 9:03 PM
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] MPI_Allgather with derived type crash
> 
> Would 8 happen to be the number of cores you have per node so what
> we're seeing is: single node OK, multi node FAIL?
> 
> If so what kind of inter node network are you (trying to) use(ing)?
> 
> /Peter



Re: [OMPI users] MPI_Allgather with derived type crash

2011-05-25 Thread Andrew Senin
Hi George, 

Thanks a lot for your attempt! Possibly this is something OS specific? I'm
using CentOS release 5.4 x86_64 on the cluster. I also tried it on my
virtual box with CentOS 5.3 x86_64 (ompi 1.4.3). The same effect. On what OS
did you try? If it helps I can upload the virtual box image on my web site.

-Andrew

> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of George Bosilca
> Sent: Wednesday, May 25, 2011 8:47 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] MPI_Allgather with derived type crash
> 
> Andrew,
> 
> I tried with a freshly installed 1.4.3 but I can't reproduce your
> issue. I tried with the 1.5 and the trunk and all complete your code
> without errors. Not even valgrind found anything to complain about ...
> 
>   george.
> 
> 
> On May 25, 2011, at 08:22 , Andrew Senin wrote:
> 
> > Sorry. I'm using OpenMPI 1.4.3.
> >
> > Thanks,
> > -Andrew
> >
> >> -Original Message-
> >> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
> On
> >> Behalf Of Peter Kjellstrom
> >> Sent: Wednesday, May 25, 2011 4:19 PM
> >> To: us...@open-mpi.org
> >> Subject: Re: [OMPI users] MPI_Allgather with derived type crash
> >>
> >> On Wednesday, May 25, 2011 01:16:04 PM Andrew Senin wrote:
> >>> Hello list,
> >>>
> >>> I have an application which uses MPI_Allgather with derived types.
> It
> >>> works correctly with mpich2 and mvapich2. However it crashes
> >>> periodically with openmpi2.
> >>
> >> Which version of OpenMPI are you using? There is no such thing as
> >> openmpi2...
> >>
> >> /Peter
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> George Bosilca
> Research Assistant Professor
> Innovative Computing Laboratory
> Department of Electrical Engineering and Computer Science
> University of Tennessee, Knoxville
> http://web.eecs.utk.edu/~bosilca/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] MPI_Allgather with derived type crash

2011-05-25 Thread Peter Kjellström
On Wednesday, May 25, 2011 01:16:04 PM Andrew Senin wrote:
> Hello list,
> 
> I have an application which uses MPI_Allgather with derived types. It works
> correctly with mpich2 and mvapich2. However it crashes periodically with
> openmpi2. After investigation I found that the crash takes place when I use
> derived datatypes with MPI_AllGather and number of ranks greater than 8.

Would 8 happen to be the number of cores you have per node so what we're 
seeing is: single node OK, multi node FAIL?

If so what kind of inter node network are you (trying to) use(ing)?

/Peter


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] MPI_Allgather with derived type crash

2011-05-25 Thread George Bosilca
Andrew,

I tried with a freshly installed 1.4.3 but I can't reproduce your issue. I 
tried with the 1.5 and the trunk and all complete your code without errors. Not 
even valgrind found anything to complain about ...

  george.


On May 25, 2011, at 08:22 , Andrew Senin wrote:

> Sorry. I'm using OpenMPI 1.4.3.
> 
> Thanks,
> -Andrew
> 
>> -Original Message-
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>> Behalf Of Peter Kjellstrom
>> Sent: Wednesday, May 25, 2011 4:19 PM
>> To: us...@open-mpi.org
>> Subject: Re: [OMPI users] MPI_Allgather with derived type crash
>> 
>> On Wednesday, May 25, 2011 01:16:04 PM Andrew Senin wrote:
>>> Hello list,
>>> 
>>> I have an application which uses MPI_Allgather with derived types. It
>>> works correctly with mpich2 and mvapich2. However it crashes
>>> periodically with openmpi2.
>> 
>> Which version of OpenMPI are you using? There is no such thing as
>> openmpi2...
>> 
>> /Peter
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

George Bosilca
Research Assistant Professor
Innovative Computing Laboratory
Department of Electrical Engineering and Computer Science
University of Tennessee, Knoxville
http://web.eecs.utk.edu/~bosilca/




Re: [OMPI users] MPI_Allgather with derived type crash

2011-05-25 Thread Andrew Senin
Sorry. I'm using OpenMPI 1.4.3.

Thanks,
-Andrew

> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Peter Kjellstrom
> Sent: Wednesday, May 25, 2011 4:19 PM
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] MPI_Allgather with derived type crash
> 
> On Wednesday, May 25, 2011 01:16:04 PM Andrew Senin wrote:
> > Hello list,
> >
> > I have an application which uses MPI_Allgather with derived types. It
> > works correctly with mpich2 and mvapich2. However it crashes
> > periodically with openmpi2.
> 
> Which version of OpenMPI are you using? There is no such thing as
> openmpi2...
> 
> /Peter



Re: [OMPI users] MPI_Allgather with derived type crash

2011-05-25 Thread Peter Kjellström
On Wednesday, May 25, 2011 01:16:04 PM Andrew Senin wrote:
> Hello list,
> 
> I have an application which uses MPI_Allgather with derived types. It works
> correctly with mpich2 and mvapich2. However it crashes periodically with
> openmpi2.

Which version of OpenMPI are you using? There is no such thing as openmpi2...

/Peter


signature.asc
Description: This is a digitally signed message part.