Re: [OMPI users] OpenMPI 1.8.4rc3, 1.6.5 and 1.6.3: segmentation violation in mca_io_romio_dist_MPI_File_close

2015-01-16 Thread Eric Chamberland


On 01/14/2015 05:57 PM, Rob Latham wrote:



On 12/17/2014 07:04 PM, Eric Chamberland wrote:

Hi!

Here is a "poor man's fix" that works for me (the idea is not from me,
thanks to Thomas H.):

#1- char* lCwd = getcwd(0,0);
#2- chdir(lPathToFile);
#3- MPI_File_open(...,lFileNameWithoutTooLongPath,...);
#4- chdir(lCwd);
#5- ...

I think there are some limitations but it works very well for our
uses... and until a "real" fix is proposed...


Thanks for the bug report and test cases.  I just pushed two fixes for
master that fix the problem you were seeing:

http://git.mpich.org/mpich.git/commit/ed39c901
http://git.mpich.org/mpich.git/commit/a30a4721a2

==rob



Great!  Thank you for the follow up (and both messages)!

Eric



Re: [OMPI users] OpenMPI 1.8.4rc3, 1.6.5 and 1.6.3: segmentation violation in mca_io_romio_dist_MPI_File_close

2015-01-14 Thread Rob Latham



On 12/17/2014 07:04 PM, Eric Chamberland wrote:

Hi!

Here is a "poor man's fix" that works for me (the idea is not from me,
thanks to Thomas H.):

#1- char* lCwd = getcwd(0,0);
#2- chdir(lPathToFile);
#3- MPI_File_open(...,lFileNameWithoutTooLongPath,...);
#4- chdir(lCwd);
#5- ...

I think there are some limitations but it works very well for our
uses... and until a "real" fix is proposed...


Thanks for the bug report and test cases.  I just pushed two fixes for 
master that fix the problem you were seeing:


http://git.mpich.org/mpich.git/commit/ed39c901
http://git.mpich.org/mpich.git/commit/a30a4721a2

==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


Re: [OMPI users] OpenMPI 1.8.4rc3, 1.6.5 and 1.6.3: segmentation violation in mca_io_romio_dist_MPI_File_close

2015-01-12 Thread Rob Latham



On 12/17/2014 07:04 PM, Eric Chamberland wrote:

Hi!

Here is a "poor man's fix" that works for me (the idea is not from me,
thanks to Thomas H.):

#1- char* lCwd = getcwd(0,0);
#2- chdir(lPathToFile);
#3- MPI_File_open(...,lFileNameWithoutTooLongPath,...);
#4- chdir(lCwd);
#5- ...

I think there are some limitations but it works very well for our
uses... and until a "real" fix is proposed...


A bit of a delay on my part due to the winter break but I have returned 
to this topic.


I have an approach that will at least tell you something went wrong in 
processing the shared file pointer name: the string is so long it 
truncates the error message, but it leaves enough to tell you what went 
wrong.


ERROR Returned by MPI: 1006695702
ERROR_string Returned by MPI: Invalid file name, error stack:
ADIOI_Shfp_fname(60): Pathname 
this/is/a_very/long/path/that/contains/a/not/so/long/filename

/but/trying/to/collectively/mpi_file_open/it/you/will/have/a/memory/corruption/resulting/of/
invalide/writing/or/reading/past/the/end/of/one/or/some/hidden/strings/in/mpio/Simpimple/use
r


At least you get "invalid file name"

Furthermore, I'm changing that code to use PATH_MAX, not 256, which 
would have fixed the specific problem you encountered (and might have 
been sufficient to get us 10 more years, at which point someone might 
try to create a file with 1000 characters in it)


==rob



Thanks for helping!

Eric


On 12/15/2014 11:42 PM, Gilles Gouaillardet wrote:

Eric and all,

That is clearly a limitation in romio, and this is being tracked at
https://trac.mpich.org/projects/mpich/ticket/2212

in the mean time, what we can do in OpenMPI is update
mca_io_romio_file_open() and fails with a user friendly error message
if strlen(filename) is larger that 225.

Cheers,

Gilles

On 2014/12/16 12:43, Gilles Gouaillardet wrote:

Eric,

thanks for the simple test program.

i think i see what is going wrong and i will make some changes to avoid
the memory overflow.

that being said, there is a hard coded limit of 256 characters, and your
path is bigger than 300 characters.
bottom line, and even if there is no more memory overflow, that cannot
work as expected.

i will report this to the mpich folks, since romio is currently imported
from mpich.

Cheers,

Gilles

On 2014/12/16 0:16, Eric Chamberland wrote:

Hi Gilles,

just created a very simple test case!

with this setup, you will see the bug with valgrind:

export
too_long=./this/is/a_very/long/path/that/contains/a/not/so/long/filename/but/trying/to/collectively/mpi_file_open/it/you/will/have/a/memory/corruption/resulting/of/invalide/writing/or/reading/past/the/end/of/one/or/some/hidden/strings/in/mpio/Simple/user/would/like/to/have/the/parameter/checked/and/an/error/returned/or/this/limit/removed


mpicc -o bug_MPI_File_open_path_too_long
bug_MPI_File_open_path_too_long.c

mkdir -p $too_long
echo "header of a text file" > $too_long/toto.txt

mpirun -np 2 valgrind ./bug_MPI_File_open_path_too_long
$too_long/toto.txt

and watch the errors!

unfortunately, the memory corruptions here doesn't seem to segfault
this simple test case, but in my case, it is fatal and with valgrind,
it is reported...

OpenMPI 1.6.5, 1.8.3rc3 are affected

MPICH-3.1.3 also have the error!

thanks,

Eric


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/12/26005.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/12/26006.php




--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


Re: [OMPI users] OpenMPI 1.8.4rc3, 1.6.5 and 1.6.3: segmentation violation in mca_io_romio_dist_MPI_File_close

2014-12-17 Thread Eric Chamberland

Hi!

Here is a "poor man's fix" that works for me (the idea is not from me, 
thanks to Thomas H.):


#1- char* lCwd = getcwd(0,0);
#2- chdir(lPathToFile);
#3- MPI_File_open(...,lFileNameWithoutTooLongPath,...);
#4- chdir(lCwd);
#5- ...

I think there are some limitations but it works very well for our 
uses... and until a "real" fix is proposed...


Thanks for helping!

Eric


On 12/15/2014 11:42 PM, Gilles Gouaillardet wrote:

Eric and all,

That is clearly a limitation in romio, and this is being tracked at
https://trac.mpich.org/projects/mpich/ticket/2212

in the mean time, what we can do in OpenMPI is update
mca_io_romio_file_open() and fails with a user friendly error message
if strlen(filename) is larger that 225.

Cheers,

Gilles

On 2014/12/16 12:43, Gilles Gouaillardet wrote:

Eric,

thanks for the simple test program.

i think i see what is going wrong and i will make some changes to avoid
the memory overflow.

that being said, there is a hard coded limit of 256 characters, and your
path is bigger than 300 characters.
bottom line, and even if there is no more memory overflow, that cannot
work as expected.

i will report this to the mpich folks, since romio is currently imported
from mpich.

Cheers,

Gilles

On 2014/12/16 0:16, Eric Chamberland wrote:

Hi Gilles,

just created a very simple test case!

with this setup, you will see the bug with valgrind:

export
too_long=./this/is/a_very/long/path/that/contains/a/not/so/long/filename/but/trying/to/collectively/mpi_file_open/it/you/will/have/a/memory/corruption/resulting/of/invalide/writing/or/reading/past/the/end/of/one/or/some/hidden/strings/in/mpio/Simple/user/would/like/to/have/the/parameter/checked/and/an/error/returned/or/this/limit/removed

mpicc -o bug_MPI_File_open_path_too_long
bug_MPI_File_open_path_too_long.c

mkdir -p $too_long
echo "header of a text file" > $too_long/toto.txt

mpirun -np 2 valgrind ./bug_MPI_File_open_path_too_long
$too_long/toto.txt

and watch the errors!

unfortunately, the memory corruptions here doesn't seem to segfault
this simple test case, but in my case, it is fatal and with valgrind,
it is reported...

OpenMPI 1.6.5, 1.8.3rc3 are affected

MPICH-3.1.3 also have the error!

thanks,

Eric


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/12/26005.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/12/26006.php




Re: [OMPI users] OpenMPI 1.8.4rc3, 1.6.5 and 1.6.3: segmentation violation in mca_io_romio_dist_MPI_File_close

2014-12-15 Thread Gilles Gouaillardet
Eric and all,

That is clearly a limitation in romio, and this is being tracked at
https://trac.mpich.org/projects/mpich/ticket/2212

in the mean time, what we can do in OpenMPI is update
mca_io_romio_file_open() and fails with a user friendly error message
if strlen(filename) is larger that 225.

Cheers,

Gilles

On 2014/12/16 12:43, Gilles Gouaillardet wrote:
> Eric,
>
> thanks for the simple test program.
>
> i think i see what is going wrong and i will make some changes to avoid
> the memory overflow.
>
> that being said, there is a hard coded limit of 256 characters, and your
> path is bigger than 300 characters.
> bottom line, and even if there is no more memory overflow, that cannot
> work as expected.
>
> i will report this to the mpich folks, since romio is currently imported
> from mpich.
>
> Cheers,
>
> Gilles
>
> On 2014/12/16 0:16, Eric Chamberland wrote:
>> Hi Gilles,
>>
>> just created a very simple test case!
>>
>> with this setup, you will see the bug with valgrind:
>>
>> export
>> too_long=./this/is/a_very/long/path/that/contains/a/not/so/long/filename/but/trying/to/collectively/mpi_file_open/it/you/will/have/a/memory/corruption/resulting/of/invalide/writing/or/reading/past/the/end/of/one/or/some/hidden/strings/in/mpio/Simple/user/would/like/to/have/the/parameter/checked/and/an/error/returned/or/this/limit/removed
>>
>> mpicc -o bug_MPI_File_open_path_too_long
>> bug_MPI_File_open_path_too_long.c
>>
>> mkdir -p $too_long
>> echo "header of a text file" > $too_long/toto.txt
>>
>> mpirun -np 2 valgrind ./bug_MPI_File_open_path_too_long 
>> $too_long/toto.txt
>>
>> and watch the errors!
>>
>> unfortunately, the memory corruptions here doesn't seem to segfault
>> this simple test case, but in my case, it is fatal and with valgrind,
>> it is reported...
>>
>> OpenMPI 1.6.5, 1.8.3rc3 are affected
>>
>> MPICH-3.1.3 also have the error!
>>
>> thanks,
>>
>> Eric
>>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/12/26005.php



Re: [OMPI users] OpenMPI 1.8.4rc3, 1.6.5 and 1.6.3: segmentation violation in mca_io_romio_dist_MPI_File_close

2014-12-15 Thread Gilles Gouaillardet
Eric,

thanks for the simple test program.

i think i see what is going wrong and i will make some changes to avoid
the memory overflow.

that being said, there is a hard coded limit of 256 characters, and your
path is bigger than 300 characters.
bottom line, and even if there is no more memory overflow, that cannot
work as expected.

i will report this to the mpich folks, since romio is currently imported
from mpich.

Cheers,

Gilles

On 2014/12/16 0:16, Eric Chamberland wrote:
> Hi Gilles,
>
> just created a very simple test case!
>
> with this setup, you will see the bug with valgrind:
>
> export
> too_long=./this/is/a_very/long/path/that/contains/a/not/so/long/filename/but/trying/to/collectively/mpi_file_open/it/you/will/have/a/memory/corruption/resulting/of/invalide/writing/or/reading/past/the/end/of/one/or/some/hidden/strings/in/mpio/Simple/user/would/like/to/have/the/parameter/checked/and/an/error/returned/or/this/limit/removed
>
> mpicc -o bug_MPI_File_open_path_too_long
> bug_MPI_File_open_path_too_long.c
>
> mkdir -p $too_long
> echo "header of a text file" > $too_long/toto.txt
>
> mpirun -np 2 valgrind ./bug_MPI_File_open_path_too_long 
> $too_long/toto.txt
>
> and watch the errors!
>
> unfortunately, the memory corruptions here doesn't seem to segfault
> this simple test case, but in my case, it is fatal and with valgrind,
> it is reported...
>
> OpenMPI 1.6.5, 1.8.3rc3 are affected
>
> MPICH-3.1.3 also have the error!
>
> thanks,
>
> Eric
>



Re: [OMPI users] OpenMPI 1.8.4rc3, 1.6.5 and 1.6.3: segmentation violation in mca_io_romio_dist_MPI_File_close

2014-12-15 Thread Eric Chamberland

Hi Gilles,

here is a simple setup to have valgrind caomplains now:

export 
too_long=./this/is/a_very/long/path/that/contains/a/not/so/long/filename/but/trying/to/collectively/mpi_file_open/it/you/will/have/a/memory/corruption/resulting/of/invalide/writing/or/reading/past/the/end/of/one/or/some/hidden/strings/in/mpio/Simple/user/would/like/to/have/the/parameter/checked/and/an/error/returned/or/this/limit/removed


mkdir -p $too_long
echo "hello world." > $too_long/toto.txt

mpicc -o bug_MPI_File_open_path_too_long bug_MPI_File_open_path_too_long.c

mpirun -np 2 valgrind ./bug_MPI_File_open_path_too_long  $too_long/toto.txt

and look at valgrind errors for invalid read/write on rank0/1.

This particular simple case doesn't segfault without valgrind, but in as 
reported in my real case, it does!


Thanks!

Eric

#include "mpi.h"
#include 
#include 
#include 

void abortOnError(int ierr) {
  if (ierr != MPI_SUCCESS) {
printf("ERROR Returned by MPI: %d\n",ierr);
char* lCharPtr = (char*) malloc(sizeof(char)*MPI_MAX_ERROR_STRING);
int lLongueur = 0;
MPI_Error_string(ierr,lCharPtr, );
printf("ERROR_string Returned by MPI: %s\n",lCharPtr);
free(lCharPtr);
MPI_Abort( MPI_COMM_WORLD, 1 );
  }
}

int openFileCollectivelyAndReadMyFormat(char* pFileName)
{
int lReturnValue = 0;
MPI_File lFile = 0; 

printf("Opening the file by MPI_file_open : %s\n", pFileName);

abortOnError(MPI_File_open( MPI_COMM_WORLD, pFileName, MPI_MODE_RDONLY, 
MPI_INFO_NULL,  ));

/*printf ("ierr=%d, lFile=%ld, lFile == MPI_FILE_NULL ? %d",ierr,lFile, 
lFile == MPI_FILE_NULL);*/
long int lTrois = 0;
char lCharGIS[]="123\0";

long int lOnze = 0;
char lCharVersion10[]="12345678901\0";

abortOnError(MPI_File_read_all(lFile, , 1, MPI_LONG, 
MPI_STATUS_IGNORE));
abortOnError(MPI_File_read_all(lFile,lCharGIS, 3, MPI_CHAR, 
MPI_STATUS_IGNORE));

if (3 != lTrois) {
  lReturnValue = 1;
}

if (0 == lReturnValue && 0 != strcmp(lCharGIS, "123\0")) {
  lReturnValue = 2;
}

if (lFile) {
  printf("  ...closing the file %s\n", pFileName);
  abortOnError(MPI_File_close( ));

}
  return lReturnValue;
}

int main(int argc, char *argv[])
{
char lValeur[1024];
char *lHints[] = {"cb_nodes", "striping_factor", "striping_unit"};
int flag;

MPI_Init(, );

if (2 != argc) {
  printf("ERROR: you must specify a filename to create.\n");
  MPI_Finalize();
  return 1;
}

if (strlen(argv[1]) < 256) {
  printf("ERROR: you must specify a path+filename longer than 256 to have 
the bug!.\n");
  MPI_Finalize();
  return 1;
}

int lResult = 0;
int i;
for (i=0; i<10 ; ++i) {
  lResult |= openFileCollectivelyAndReadMyFormat(argv[1]);   
}

MPI_Finalize();

return lResult;
}

Re: [OMPI users] OpenMPI 1.8.4rc3, 1.6.5 and 1.6.3: segmentation violation in mca_io_romio_dist_MPI_File_close

2014-12-14 Thread Eric Chamberland

Hi Gilles,

ok I patched the file,  without valgrind it exploded at MPI_File_close:

*** Error in 
`/pmi/cmpbib/compilation_BIB_gcc-4.5.1_64bit/COMPILE_AUTO/GIREF/bin/Test.NormesEtProjectionChamp.dev': 
free(): invalid next size (normal): 0x04b6c950 ***

=== Backtrace: =
/lib64/libc.so.6(+0x7ac56)[0x7fab5692bc56]
/lib64/libc.so.6(+0x7b9d3)[0x7fab5692c9d3]
/opt/openmpi-1.8.4rc3_debug/lib64/openmpi/mca_io_romio.so(ADIOI_Free_fn+0x5f)[0x7fab4c1b9920]
/opt/openmpi-1.8.4rc3_debug/lib64/openmpi/mca_io_romio.so(mca_io_romio_dist_MPI_File_close+0xc6)[0x7fab4c185afa]
/opt/openmpi-1.8.4rc3_debug/lib64/openmpi/mca_io_romio.so(mca_io_romio_file_close+0x2be)[0x7fab4c180e88]
/opt/openmpi-1.8.4rc3_debug/lib64/libmpi.so.1(+0x4c09c)[0x7fab574ed09c]
/opt/openmpi-1.8.4rc3_debug/lib64/libmpi.so.1(+0x4af4b)[0x7fab574ebf4b]
/opt/openmpi-1.8.4rc3_debug/lib64/libmpi.so.1(ompi_file_close+0xd7)[0x7fab574eca0d]
/opt/openmpi-1.8.4rc3_debug/lib64/libmpi.so.1(PMPI_File_close+0xc1)[0x7fab57572e62]
/pmi/cmpbib/compilation_BIB_gcc-4.5.1_64bit/COMPILE_AUTO/GIREF/lib/libgiref_dev_Champs.so(_ZN18GISLectureEcritureIdE9litGISMPIESsR13GroupeInfoSurIdERSs+0x258f)[0x7fab658bb637]
/pmi/cmpbib/compilation_BIB_gcc-4.5.1_64bit/COMPILE_AUTO/GIREF/lib/libgiref_dev_Champs.so(_ZN5Champ16importeParalleleERKSs+0x2ae)[0x7fab65898f0e]
/pmi/cmpbib/compilation_BIB_gcc-4.5.1_64bit/COMPILE_AUTO/GIREF/bin/Test.NormesEtProjectionChamp.dev[0x4d0def]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fab568d2a15]
/pmi/cmpbib/compilation_BIB_gcc-4.5.1_64bit/COMPILE_AUTO/GIREF/bin/Test.NormesEtProjectionChamp.dev[0x4b1429]


I will launch it in valgrind now...

but since it last 20 minutes, I will send the result tomorrow only...

anyway, merci beaucoup! :-)

Eric

On 12/14/2014 10:26 PM, Gilles Gouaillardet wrote:

Eric,

here is a patch for the v1.8 series, it fixes a one byte overflow.

valgrind should stop complaining, and assuming this is the root cause of
the memory corruption,
that could also fix your program.

that being said, shared_fp_fname is limited to 255 characters (this is
hard coded) so even if
it gets truncated to 255 characters (instead of 256), the behavior could
be kind of random.

/* from ADIOI_Shfp_fname :
   If the real file is /tmp/thakur/testfile, the shared-file-pointer
file will be /tmp/thakur/.testfile.shfp., where  is

FWIW,  is a random number that takes between 1 and 10 characters

could you please give this patch a try and let us know the results ?

Cheers,

Gilles

On 2014/12/15 11:52, Eric Chamberland wrote:

Hi again,

some new hints that might help:

1- With valgrind : If I run the same test case, same data, but
moved to a shorter path+filename, then *valgrind* does *not*
complains!!
2- Without valgrind: *Sometimes*, the test case with long
path+filename passes without "segfaulting"!
3- It seems to happen at the fourth file I try to open using the
following described procedure:

Also, I was wondering about this: In this 2 processes test case
(running in the same node), I :

1- open the file collectively (which resides on the same ssd drive on
my computer)
2-  MPI_File_read_at_all a long int and 3 chars (11 bytes)
3- stop (because I detect I am not reading my MPIIO file format)
4- close the file

A guess (FWIW): Can process rank 0, for example close the file too
quickly, which destroys the string reserved for the filename that is
used by process rank 1 which could be using shared memory on the same
node?

Thanks,

Eric

On 12/14/2014 02:06 PM, Eric Chamberland wrote:

Hi,

I finally (thanks for fixing oversubscribing) tested with 1.8.4rc3 for
my problem with collective MPI I/O.

A problem still there.  In this 2 processes example, process rank 1
dies with segfault while process rank 0 wait indefinitely...

Running with valgrind, I found these errors which may gives hints:

*
Rank 1:
*
On process rank 1, without valgrind it ends with either a segmentation
violation or memory corruption or invalide free without valgrind).

But running with valgrind, it tells:

==16715== Invalid write of size 2
==16715==at 0x4C2E793: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:915)
==16715==by 0x1F60AA91: opal_convertor_unpack (opal_convertor.c:321)
==16715==by 0x25AA8DD3: mca_pml_ob1_recv_frag_callback_match
(pml_ob1_recvfrag.c:225)
==16715==by 0x2544110C: mca_btl_vader_check_fboxes
(btl_vader_fbox.h:220)
==16715==by 0x25443577: mca_btl_vader_component_progress
(btl_vader_component.c:695)
==16715==by 0x1F5F0F27: opal_progress (opal_progress.c:207)
==16715==by 0x1ACB40B3: opal_condition_wait (condition.h:93)
==16715==by 0x1ACB4201: ompi_request_wait_completion (request.h:381)
==16715==by 0x1ACB4305: ompi_request_default_wait (req_wait.c:39)
==16715==by 0x26BA2FFB: ompi_coll_tuned_bcast_intra_generic
(coll_tuned_bcast.c:254)
==16715==by 0x26BA36F7: 

Re: [OMPI users] OpenMPI 1.8.4rc3, 1.6.5 and 1.6.3: segmentation violation in mca_io_romio_dist_MPI_File_close

2014-12-14 Thread Eric Chamberland

On 12/14/2014 09:55 PM, Gilles Gouaillardet wrote:

Eric,

i checked the source code (v1.8) and the limit for the shared_fp_fname
is 256 (hard coded).
Oh my god!  Is it that simple?  By the way, my filename is shorter thant 
256, but the whole path is:


echo 
"/pmi/cmpbib/compilation_BIB_gcc-4.5.1_64bit/COMPILE_AUTO/TestValidation/Ressources/dev/Test.NormesEtProjectionChamp/Ressources.champscalhermite2dordre5incarete_elemtri_2proc/Resultats.Etalons/champscalhermite2dordre5incarete_elemtri_2procReinterpole_UAna.U0" 
|wc -c


258 characters 

If this is it, I hope it can be fixed in some manner... at least an 
error message!...


Thanks,

Eric



i am now checking if the overflow is correctly detected (that could
explain the one byte overflow reported by valgrind)

Cheers,

Gilles

On 2014/12/15 11:52, Eric Chamberland wrote:

Hi again,

some new hints that might help:

1- With valgrind : If I run the same test case, same data, but
moved to a shorter path+filename, then *valgrind* does *not*
complains!!
2- Without valgrind: *Sometimes*, the test case with long
path+filename passes without "segfaulting"!
3- It seems to happen at the fourth file I try to open using the
following described procedure:

Also, I was wondering about this: In this 2 processes test case
(running in the same node), I :

1- open the file collectively (which resides on the same ssd drive on
my computer)
2-  MPI_File_read_at_all a long int and 3 chars (11 bytes)
3- stop (because I detect I am not reading my MPIIO file format)
4- close the file

A guess (FWIW): Can process rank 0, for example close the file too
quickly, which destroys the string reserved for the filename that is
used by process rank 1 which could be using shared memory on the same
node?

Thanks,

Eric

On 12/14/2014 02:06 PM, Eric Chamberland wrote:

Hi,

I finally (thanks for fixing oversubscribing) tested with 1.8.4rc3 for
my problem with collective MPI I/O.

A problem still there.  In this 2 processes example, process rank 1
dies with segfault while process rank 0 wait indefinitely...

Running with valgrind, I found these errors which may gives hints:

*
Rank 1:
*
On process rank 1, without valgrind it ends with either a segmentation
violation or memory corruption or invalide free without valgrind).

But running with valgrind, it tells:

==16715== Invalid write of size 2
==16715==at 0x4C2E793: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:915)
==16715==by 0x1F60AA91: opal_convertor_unpack (opal_convertor.c:321)
==16715==by 0x25AA8DD3: mca_pml_ob1_recv_frag_callback_match
(pml_ob1_recvfrag.c:225)
==16715==by 0x2544110C: mca_btl_vader_check_fboxes
(btl_vader_fbox.h:220)
==16715==by 0x25443577: mca_btl_vader_component_progress
(btl_vader_component.c:695)
==16715==by 0x1F5F0F27: opal_progress (opal_progress.c:207)
==16715==by 0x1ACB40B3: opal_condition_wait (condition.h:93)
==16715==by 0x1ACB4201: ompi_request_wait_completion (request.h:381)
==16715==by 0x1ACB4305: ompi_request_default_wait (req_wait.c:39)
==16715==by 0x26BA2FFB: ompi_coll_tuned_bcast_intra_generic
(coll_tuned_bcast.c:254)
==16715==by 0x26BA36F7: ompi_coll_tuned_bcast_intra_binomial
(coll_tuned_bcast.c:385)
==16715==by 0x26B94289: ompi_coll_tuned_bcast_intra_dec_fixed
(coll_tuned_decision_fixed.c:258)
==16715==by 0x1ACD55F2: PMPI_Bcast (pbcast.c:110)
==16715==by 0x2FE1CC48: ADIOI_Shfp_fname (shfp_fname.c:67)
==16715==by 0x2FDEB493: mca_io_romio_dist_MPI_File_open (open.c:177)
==16715==by 0x2FDE3B0D: mca_io_romio_file_open
(io_romio_file_open.c:40)
==16715==by 0x1AD52344: module_init (io_base_file_select.c:455)
==16715==by 0x1AD51DFA: mca_io_base_file_select
(io_base_file_select.c:238)
==16715==by 0x1ACA582F: ompi_file_open (file.c:130)
==16715==by 0x1AD30DA3: PMPI_File_open (pfile_open.c:94)
==16715==by 0x13F9B36F:
PAIO::ouvreFichierMPIIO(PAGroupeProcessus&, std::string const&, int,
ompi_file_t*&, bool) (PAIO.cc:290)
==16715==by 0xCA44252:
GISLectureEcriture::litGISMPI(std::string,
GroupeInfoSur&, std::string&) (GISLectureEcriture.icc:411)
==16715==by 0xCA23F0D: Champ::importeParallele(std::string const&)
(Champ.cc:951)
==16715==by 0x4D0DEE: main (Test.NormesEtProjectionChamp.cc:789)
==16715==  Address 0x32ef3e50 is 0 bytes after a block of size 256
alloc'd
==16715==at 0x4C2C5A4: malloc (vg_replace_malloc.c:296)
==16715==by 0x2FE1C78E: ADIOI_Malloc_fn (malloc.c:50)
==16715==by 0x2FE1C951: ADIOI_Shfp_fname (shfp_fname.c:25)
==16715==by 0x2FDEB493: mca_io_romio_dist_MPI_File_open (open.c:177)
==16715==by 0x2FDE3B0D: mca_io_romio_file_open
(io_romio_file_open.c:40)
==16715==by 0x1AD52344: module_init (io_base_file_select.c:455)
==16715==by 0x1AD51DFA: mca_io_base_file_select
(io_base_file_select.c:238)
==16715==by 0x1ACA582F: ompi_file_open (file.c:130)
==16715==by 0x1AD30DA3: 

Re: [OMPI users] OpenMPI 1.8.4rc3, 1.6.5 and 1.6.3: segmentation violation in mca_io_romio_dist_MPI_File_close

2014-12-14 Thread Gilles Gouaillardet
Eric,

i checked the source code (v1.8) and the limit for the shared_fp_fname
is 256 (hard coded).

i am now checking if the overflow is correctly detected (that could
explain the one byte overflow reported by valgrind)

Cheers,

Gilles

On 2014/12/15 11:52, Eric Chamberland wrote:
> Hi again,
>
> some new hints that might help:
>
> 1- With valgrind : If I run the same test case, same data, but
> moved to a shorter path+filename, then *valgrind* does *not*
> complains!!
> 2- Without valgrind: *Sometimes*, the test case with long
> path+filename passes without "segfaulting"!
> 3- It seems to happen at the fourth file I try to open using the
> following described procedure:
>
> Also, I was wondering about this: In this 2 processes test case
> (running in the same node), I :
>
> 1- open the file collectively (which resides on the same ssd drive on
> my computer)
> 2-  MPI_File_read_at_all a long int and 3 chars (11 bytes)
> 3- stop (because I detect I am not reading my MPIIO file format)
> 4- close the file
>
> A guess (FWIW): Can process rank 0, for example close the file too
> quickly, which destroys the string reserved for the filename that is
> used by process rank 1 which could be using shared memory on the same
> node?
>
> Thanks,
>
> Eric
>
> On 12/14/2014 02:06 PM, Eric Chamberland wrote:
>> Hi,
>>
>> I finally (thanks for fixing oversubscribing) tested with 1.8.4rc3 for
>> my problem with collective MPI I/O.
>>
>> A problem still there.  In this 2 processes example, process rank 1
>> dies with segfault while process rank 0 wait indefinitely...
>>
>> Running with valgrind, I found these errors which may gives hints:
>>
>> *
>> Rank 1:
>> *
>> On process rank 1, without valgrind it ends with either a segmentation
>> violation or memory corruption or invalide free without valgrind).
>>
>> But running with valgrind, it tells:
>>
>> ==16715== Invalid write of size 2
>> ==16715==at 0x4C2E793: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:915)
>> ==16715==by 0x1F60AA91: opal_convertor_unpack (opal_convertor.c:321)
>> ==16715==by 0x25AA8DD3: mca_pml_ob1_recv_frag_callback_match
>> (pml_ob1_recvfrag.c:225)
>> ==16715==by 0x2544110C: mca_btl_vader_check_fboxes
>> (btl_vader_fbox.h:220)
>> ==16715==by 0x25443577: mca_btl_vader_component_progress
>> (btl_vader_component.c:695)
>> ==16715==by 0x1F5F0F27: opal_progress (opal_progress.c:207)
>> ==16715==by 0x1ACB40B3: opal_condition_wait (condition.h:93)
>> ==16715==by 0x1ACB4201: ompi_request_wait_completion (request.h:381)
>> ==16715==by 0x1ACB4305: ompi_request_default_wait (req_wait.c:39)
>> ==16715==by 0x26BA2FFB: ompi_coll_tuned_bcast_intra_generic
>> (coll_tuned_bcast.c:254)
>> ==16715==by 0x26BA36F7: ompi_coll_tuned_bcast_intra_binomial
>> (coll_tuned_bcast.c:385)
>> ==16715==by 0x26B94289: ompi_coll_tuned_bcast_intra_dec_fixed
>> (coll_tuned_decision_fixed.c:258)
>> ==16715==by 0x1ACD55F2: PMPI_Bcast (pbcast.c:110)
>> ==16715==by 0x2FE1CC48: ADIOI_Shfp_fname (shfp_fname.c:67)
>> ==16715==by 0x2FDEB493: mca_io_romio_dist_MPI_File_open (open.c:177)
>> ==16715==by 0x2FDE3B0D: mca_io_romio_file_open
>> (io_romio_file_open.c:40)
>> ==16715==by 0x1AD52344: module_init (io_base_file_select.c:455)
>> ==16715==by 0x1AD51DFA: mca_io_base_file_select
>> (io_base_file_select.c:238)
>> ==16715==by 0x1ACA582F: ompi_file_open (file.c:130)
>> ==16715==by 0x1AD30DA3: PMPI_File_open (pfile_open.c:94)
>> ==16715==by 0x13F9B36F:
>> PAIO::ouvreFichierMPIIO(PAGroupeProcessus&, std::string const&, int,
>> ompi_file_t*&, bool) (PAIO.cc:290)
>> ==16715==by 0xCA44252:
>> GISLectureEcriture::litGISMPI(std::string,
>> GroupeInfoSur&, std::string&) (GISLectureEcriture.icc:411)
>> ==16715==by 0xCA23F0D: Champ::importeParallele(std::string const&)
>> (Champ.cc:951)
>> ==16715==by 0x4D0DEE: main (Test.NormesEtProjectionChamp.cc:789)
>> ==16715==  Address 0x32ef3e50 is 0 bytes after a block of size 256
>> alloc'd
>> ==16715==at 0x4C2C5A4: malloc (vg_replace_malloc.c:296)
>> ==16715==by 0x2FE1C78E: ADIOI_Malloc_fn (malloc.c:50)
>> ==16715==by 0x2FE1C951: ADIOI_Shfp_fname (shfp_fname.c:25)
>> ==16715==by 0x2FDEB493: mca_io_romio_dist_MPI_File_open (open.c:177)
>> ==16715==by 0x2FDE3B0D: mca_io_romio_file_open
>> (io_romio_file_open.c:40)
>> ==16715==by 0x1AD52344: module_init (io_base_file_select.c:455)
>> ==16715==by 0x1AD51DFA: mca_io_base_file_select
>> (io_base_file_select.c:238)
>> ==16715==by 0x1ACA582F: ompi_file_open (file.c:130)
>> ==16715==by 0x1AD30DA3: PMPI_File_open (pfile_open.c:94)
>> ==16715==by 0x13F9B36F:
>> PAIO::ouvreFichierMPIIO(PAGroupeProcessus&, std::string const&, int,
>> ompi_file_t*&, bool) (PAIO.cc:290)
>> ==16715==by 0xCA44252:
>> GISLectureEcriture::litGISMPI(std::string,
>> GroupeInfoSur&, std::string&) 

Re: [OMPI users] OpenMPI 1.8.4rc3, 1.6.5 and 1.6.3: segmentation violation in mca_io_romio_dist_MPI_File_close

2014-12-14 Thread Eric Chamberland

Hi again,

some new hints that might help:

1- With valgrind : If I run the same test case, same data, but moved 
to a shorter path+filename, then *valgrind* does *not* complains!!
2- Without valgrind: *Sometimes*, the test case with long path+filename 
passes without "segfaulting"!
3- It seems to happen at the fourth file I try to open using the 
following described procedure:


Also, I was wondering about this: In this 2 processes test case (running 
in the same node), I :


1- open the file collectively (which resides on the same ssd drive on my 
computer)

2-  MPI_File_read_at_all a long int and 3 chars (11 bytes)
3- stop (because I detect I am not reading my MPIIO file format)
4- close the file

A guess (FWIW): Can process rank 0, for example close the file too 
quickly, which destroys the string reserved for the filename that is 
used by process rank 1 which could be using shared memory on the same node?


Thanks,

Eric

On 12/14/2014 02:06 PM, Eric Chamberland wrote:

Hi,

I finally (thanks for fixing oversubscribing) tested with 1.8.4rc3 for
my problem with collective MPI I/O.

A problem still there.  In this 2 processes example, process rank 1
dies with segfault while process rank 0 wait indefinitely...

Running with valgrind, I found these errors which may gives hints:

*
Rank 1:
*
On process rank 1, without valgrind it ends with either a segmentation
violation or memory corruption or invalide free without valgrind).

But running with valgrind, it tells:

==16715== Invalid write of size 2
==16715==at 0x4C2E793: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:915)
==16715==by 0x1F60AA91: opal_convertor_unpack (opal_convertor.c:321)
==16715==by 0x25AA8DD3: mca_pml_ob1_recv_frag_callback_match
(pml_ob1_recvfrag.c:225)
==16715==by 0x2544110C: mca_btl_vader_check_fboxes
(btl_vader_fbox.h:220)
==16715==by 0x25443577: mca_btl_vader_component_progress
(btl_vader_component.c:695)
==16715==by 0x1F5F0F27: opal_progress (opal_progress.c:207)
==16715==by 0x1ACB40B3: opal_condition_wait (condition.h:93)
==16715==by 0x1ACB4201: ompi_request_wait_completion (request.h:381)
==16715==by 0x1ACB4305: ompi_request_default_wait (req_wait.c:39)
==16715==by 0x26BA2FFB: ompi_coll_tuned_bcast_intra_generic
(coll_tuned_bcast.c:254)
==16715==by 0x26BA36F7: ompi_coll_tuned_bcast_intra_binomial
(coll_tuned_bcast.c:385)
==16715==by 0x26B94289: ompi_coll_tuned_bcast_intra_dec_fixed
(coll_tuned_decision_fixed.c:258)
==16715==by 0x1ACD55F2: PMPI_Bcast (pbcast.c:110)
==16715==by 0x2FE1CC48: ADIOI_Shfp_fname (shfp_fname.c:67)
==16715==by 0x2FDEB493: mca_io_romio_dist_MPI_File_open (open.c:177)
==16715==by 0x2FDE3B0D: mca_io_romio_file_open
(io_romio_file_open.c:40)
==16715==by 0x1AD52344: module_init (io_base_file_select.c:455)
==16715==by 0x1AD51DFA: mca_io_base_file_select
(io_base_file_select.c:238)
==16715==by 0x1ACA582F: ompi_file_open (file.c:130)
==16715==by 0x1AD30DA3: PMPI_File_open (pfile_open.c:94)
==16715==by 0x13F9B36F:
PAIO::ouvreFichierMPIIO(PAGroupeProcessus&, std::string const&, int,
ompi_file_t*&, bool) (PAIO.cc:290)
==16715==by 0xCA44252:
GISLectureEcriture::litGISMPI(std::string,
GroupeInfoSur&, std::string&) (GISLectureEcriture.icc:411)
==16715==by 0xCA23F0D: Champ::importeParallele(std::string const&)
(Champ.cc:951)
==16715==by 0x4D0DEE: main (Test.NormesEtProjectionChamp.cc:789)
==16715==  Address 0x32ef3e50 is 0 bytes after a block of size 256
alloc'd
==16715==at 0x4C2C5A4: malloc (vg_replace_malloc.c:296)
==16715==by 0x2FE1C78E: ADIOI_Malloc_fn (malloc.c:50)
==16715==by 0x2FE1C951: ADIOI_Shfp_fname (shfp_fname.c:25)
==16715==by 0x2FDEB493: mca_io_romio_dist_MPI_File_open (open.c:177)
==16715==by 0x2FDE3B0D: mca_io_romio_file_open
(io_romio_file_open.c:40)
==16715==by 0x1AD52344: module_init (io_base_file_select.c:455)
==16715==by 0x1AD51DFA: mca_io_base_file_select
(io_base_file_select.c:238)
==16715==by 0x1ACA582F: ompi_file_open (file.c:130)
==16715==by 0x1AD30DA3: PMPI_File_open (pfile_open.c:94)
==16715==by 0x13F9B36F:
PAIO::ouvreFichierMPIIO(PAGroupeProcessus&, std::string const&, int,
ompi_file_t*&, bool) (PAIO.cc:290)
==16715==by 0xCA44252:
GISLectureEcriture::litGISMPI(std::string,
GroupeInfoSur&, std::string&) (GISLectureEcriture.icc:411)
==16715==by 0xCA23F0D: Champ::importeParallele(std::string const&)
(Champ.cc:951)
==16715==by 0x4D0DEE: main (Test.NormesEtProjectionChamp.cc:789)
...
...
==16715== Invalid write of size 1
==16715==at 0x4C2E7BB: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:915)
==16715==by 0x1F60AA91: opal_convertor_unpack (opal_convertor.c:321)
==16715==by 0x25AA8DD3: mca_pml_ob1_recv_frag_callback_match
(pml_ob1_recvfrag.c:225)
==16715==by 0x2544110C: mca_btl_vader_check_fboxes
(btl_vader_fbox.h:220)
==16715==by 

Re: [OMPI users] OpenMPI 1.8.4rc3, 1.6.5 and 1.6.3: segmentation violation in mca_io_romio_dist_MPI_File_close

2014-12-14 Thread Eric Chamberland

Hi Gilles,

On 12/14/2014 09:20 PM, Gilles Gouaillardet wrote:

Eric,

can you make your test case (source + input file + howto) available so i
can try to reproduce and fix this ?
I would like to, but the complete app is big (and not public), is on top 
of PETSc with mkl, and in C++... :-(


I can for sure send you binaries if you have any of the following 
plateforms (RedHat 6.6, openSUSE 13.1 , openSUSE 12.3 , Fedora 19, 
RedHat 5.7 or openSUSE 11.3 ) and input files (maybe we could get it run 
in a chrooted environnement? but I never tried this), but our source 
code I don't think I can... but I would like to post a simple example 
showing the problem...





Based on the stack trace, i assume this is a complete end user application.
have you tried/been able to reproduce the same kind of crash with a
trimmed test program ?

I am trying to do so right now... ;-)

I try to reproduce the very exact order for open/close of files by MPI 
followed with "normal" open of the file, etc...


If I can reproduce the problem, I will send it immediatly to the list.  
It is an intermittent problem, but valgrind seems to catch it every time!


I will work on this this evening and in the following days, hoping to 
send it in time before the final release...




BTW, what kind of filesystem is hosting Resultats.Eta1 ? (e.g. ext4 /
nfs / lustre / other)


It is a local hard drive with ext4.

oh, just noticed that one of my mail didn't made it to the list... I 
will try to resend it now... contains a few hints...


Merci! :-)

Eric



Re: [OMPI users] OpenMPI 1.8.4rc3, 1.6.5 and 1.6.3: segmentation violation in mca_io_romio_dist_MPI_File_close

2014-12-14 Thread Gilles Gouaillardet
Eric,

can you make your test case (source + input file + howto) available so i
can try to reproduce and fix this ?

Based on the stack trace, i assume this is a complete end user application.
have you tried/been able to reproduce the same kind of crash with a
trimmed test program ?

BTW, what kind of filesystem is hosting Resultats.Eta1 ? (e.g. ext4 /
nfs / lustre / other)

Cheers,

Gilles

On 2014/12/15 4:06, Eric Chamberland wrote:
> Hi,
>
> I finally (thanks for fixing oversubscribing) tested with 1.8.4rc3 for
> my problem with collective MPI I/O.
>
> A problem still there.  In this 2 processes example, process rank 1
> dies with segfault while process rank 0 wait indefinitely...
>
> Running with valgrind, I found these errors which may gives hints:
>
> *
> Rank 1:
> *
> On process rank 1, without valgrind it ends with either a segmentation
> violation or memory corruption or invalide free without valgrind).
>
> But running with valgrind, it tells:
>
> ==16715== Invalid write of size 2
> ==16715==at 0x4C2E793: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:915)
> ==16715==by 0x1F60AA91: opal_convertor_unpack (opal_convertor.c:321)
> ==16715==by 0x25AA8DD3: mca_pml_ob1_recv_frag_callback_match
> (pml_ob1_recvfrag.c:225)
> ==16715==by 0x2544110C: mca_btl_vader_check_fboxes
> (btl_vader_fbox.h:220)
> ==16715==by 0x25443577: mca_btl_vader_component_progress
> (btl_vader_component.c:695)
> ==16715==by 0x1F5F0F27: opal_progress (opal_progress.c:207)
> ==16715==by 0x1ACB40B3: opal_condition_wait (condition.h:93)
> ==16715==by 0x1ACB4201: ompi_request_wait_completion (request.h:381)
> ==16715==by 0x1ACB4305: ompi_request_default_wait (req_wait.c:39)
> ==16715==by 0x26BA2FFB: ompi_coll_tuned_bcast_intra_generic
> (coll_tuned_bcast.c:254)
> ==16715==by 0x26BA36F7: ompi_coll_tuned_bcast_intra_binomial
> (coll_tuned_bcast.c:385)
> ==16715==by 0x26B94289: ompi_coll_tuned_bcast_intra_dec_fixed
> (coll_tuned_decision_fixed.c:258)
> ==16715==by 0x1ACD55F2: PMPI_Bcast (pbcast.c:110)
> ==16715==by 0x2FE1CC48: ADIOI_Shfp_fname (shfp_fname.c:67)
> ==16715==by 0x2FDEB493: mca_io_romio_dist_MPI_File_open (open.c:177)
> ==16715==by 0x2FDE3B0D: mca_io_romio_file_open
> (io_romio_file_open.c:40)
> ==16715==by 0x1AD52344: module_init (io_base_file_select.c:455)
> ==16715==by 0x1AD51DFA: mca_io_base_file_select
> (io_base_file_select.c:238)
> ==16715==by 0x1ACA582F: ompi_file_open (file.c:130)
> ==16715==by 0x1AD30DA3: PMPI_File_open (pfile_open.c:94)
> ==16715==by 0x13F9B36F:
> PAIO::ouvreFichierMPIIO(PAGroupeProcessus&, std::string const&, int,
> ompi_file_t*&, bool) (PAIO.cc:290)
> ==16715==by 0xCA44252:
> GISLectureEcriture::litGISMPI(std::string,
> GroupeInfoSur&, std::string&) (GISLectureEcriture.icc:411)
> ==16715==by 0xCA23F0D: Champ::importeParallele(std::string const&)
> (Champ.cc:951)
> ==16715==by 0x4D0DEE: main (Test.NormesEtProjectionChamp.cc:789)
> ==16715==  Address 0x32ef3e50 is 0 bytes after a block of size 256
> alloc'd
> ==16715==at 0x4C2C5A4: malloc (vg_replace_malloc.c:296)
> ==16715==by 0x2FE1C78E: ADIOI_Malloc_fn (malloc.c:50)
> ==16715==by 0x2FE1C951: ADIOI_Shfp_fname (shfp_fname.c:25)
> ==16715==by 0x2FDEB493: mca_io_romio_dist_MPI_File_open (open.c:177)
> ==16715==by 0x2FDE3B0D: mca_io_romio_file_open
> (io_romio_file_open.c:40)
> ==16715==by 0x1AD52344: module_init (io_base_file_select.c:455)
> ==16715==by 0x1AD51DFA: mca_io_base_file_select
> (io_base_file_select.c:238)
> ==16715==by 0x1ACA582F: ompi_file_open (file.c:130)
> ==16715==by 0x1AD30DA3: PMPI_File_open (pfile_open.c:94)
> ==16715==by 0x13F9B36F:
> PAIO::ouvreFichierMPIIO(PAGroupeProcessus&, std::string const&, int,
> ompi_file_t*&, bool) (PAIO.cc:290)
> ==16715==by 0xCA44252:
> GISLectureEcriture::litGISMPI(std::string,
> GroupeInfoSur&, std::string&) (GISLectureEcriture.icc:411)
> ==16715==by 0xCA23F0D: Champ::importeParallele(std::string const&)
> (Champ.cc:951)
> ==16715==by 0x4D0DEE: main (Test.NormesEtProjectionChamp.cc:789)
> ...
> ...
> ==16715== Invalid write of size 1
> ==16715==at 0x4C2E7BB: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:915)
> ==16715==by 0x1F60AA91: opal_convertor_unpack (opal_convertor.c:321)
> ==16715==by 0x25AA8DD3: mca_pml_ob1_recv_frag_callback_match
> (pml_ob1_recvfrag.c:225)
> ==16715==by 0x2544110C: mca_btl_vader_check_fboxes
> (btl_vader_fbox.h:220)
> ==16715==by 0x25443577: mca_btl_vader_component_progress
> (btl_vader_component.c:695)
> ==16715==by 0x1F5F0F27: opal_progress (opal_progress.c:207)
> ==16715==by 0x1ACB40B3: opal_condition_wait (condition.h:93)
> ==16715==by 0x1ACB4201: ompi_request_wait_completion (request.h:381)
> ==16715==by 0x1ACB4305: ompi_request_default_wait (req_wait.c:39)
> ==16715==by 0x26BA2FFB: 

Re: [OMPI users] OpenMPI 1.8.4rc3, 1.6.5 and 1.6.3: segmentation violation in mca_io_romio_dist_MPI_File_close

2014-12-14 Thread Eric Chamberland

Hi,

I finally (thanks for fixing oversubscribing) tested with 1.8.4rc3 for 
my problem with collective MPI I/O.


A problem still there.  In this 2 processes example, process rank 1 dies 
with segfault while process rank 0 wait indefinitely...


Running with valgrind, I found these errors which may gives hints:

*
Rank 1:
*
On process rank 1, without valgrind it ends with either a segmentation 
violation or memory corruption or invalide free without valgrind).


But running with valgrind, it tells:

==16715== Invalid write of size 2
==16715==at 0x4C2E793: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:915)
==16715==by 0x1F60AA91: opal_convertor_unpack (opal_convertor.c:321)
==16715==by 0x25AA8DD3: mca_pml_ob1_recv_frag_callback_match 
(pml_ob1_recvfrag.c:225)
==16715==by 0x2544110C: mca_btl_vader_check_fboxes 
(btl_vader_fbox.h:220)
==16715==by 0x25443577: mca_btl_vader_component_progress 
(btl_vader_component.c:695)

==16715==by 0x1F5F0F27: opal_progress (opal_progress.c:207)
==16715==by 0x1ACB40B3: opal_condition_wait (condition.h:93)
==16715==by 0x1ACB4201: ompi_request_wait_completion (request.h:381)
==16715==by 0x1ACB4305: ompi_request_default_wait (req_wait.c:39)
==16715==by 0x26BA2FFB: ompi_coll_tuned_bcast_intra_generic 
(coll_tuned_bcast.c:254)
==16715==by 0x26BA36F7: ompi_coll_tuned_bcast_intra_binomial 
(coll_tuned_bcast.c:385)
==16715==by 0x26B94289: ompi_coll_tuned_bcast_intra_dec_fixed 
(coll_tuned_decision_fixed.c:258)

==16715==by 0x1ACD55F2: PMPI_Bcast (pbcast.c:110)
==16715==by 0x2FE1CC48: ADIOI_Shfp_fname (shfp_fname.c:67)
==16715==by 0x2FDEB493: mca_io_romio_dist_MPI_File_open (open.c:177)
==16715==by 0x2FDE3B0D: mca_io_romio_file_open (io_romio_file_open.c:40)
==16715==by 0x1AD52344: module_init (io_base_file_select.c:455)
==16715==by 0x1AD51DFA: mca_io_base_file_select 
(io_base_file_select.c:238)

==16715==by 0x1ACA582F: ompi_file_open (file.c:130)
==16715==by 0x1AD30DA3: PMPI_File_open (pfile_open.c:94)
==16715==by 0x13F9B36F: PAIO::ouvreFichierMPIIO(PAGroupeProcessus&, 
std::string const&, int, ompi_file_t*&, bool) (PAIO.cc:290)
==16715==by 0xCA44252: 
GISLectureEcriture::litGISMPI(std::string, 
GroupeInfoSur&, std::string&) (GISLectureEcriture.icc:411)
==16715==by 0xCA23F0D: Champ::importeParallele(std::string const&) 
(Champ.cc:951)

==16715==by 0x4D0DEE: main (Test.NormesEtProjectionChamp.cc:789)
==16715==  Address 0x32ef3e50 is 0 bytes after a block of size 256 alloc'd
==16715==at 0x4C2C5A4: malloc (vg_replace_malloc.c:296)
==16715==by 0x2FE1C78E: ADIOI_Malloc_fn (malloc.c:50)
==16715==by 0x2FE1C951: ADIOI_Shfp_fname (shfp_fname.c:25)
==16715==by 0x2FDEB493: mca_io_romio_dist_MPI_File_open (open.c:177)
==16715==by 0x2FDE3B0D: mca_io_romio_file_open (io_romio_file_open.c:40)
==16715==by 0x1AD52344: module_init (io_base_file_select.c:455)
==16715==by 0x1AD51DFA: mca_io_base_file_select 
(io_base_file_select.c:238)

==16715==by 0x1ACA582F: ompi_file_open (file.c:130)
==16715==by 0x1AD30DA3: PMPI_File_open (pfile_open.c:94)
==16715==by 0x13F9B36F: PAIO::ouvreFichierMPIIO(PAGroupeProcessus&, 
std::string const&, int, ompi_file_t*&, bool) (PAIO.cc:290)
==16715==by 0xCA44252: 
GISLectureEcriture::litGISMPI(std::string, 
GroupeInfoSur&, std::string&) (GISLectureEcriture.icc:411)
==16715==by 0xCA23F0D: Champ::importeParallele(std::string const&) 
(Champ.cc:951)

==16715==by 0x4D0DEE: main (Test.NormesEtProjectionChamp.cc:789)
...
...
==16715== Invalid write of size 1
==16715==at 0x4C2E7BB: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:915)
==16715==by 0x1F60AA91: opal_convertor_unpack (opal_convertor.c:321)
==16715==by 0x25AA8DD3: mca_pml_ob1_recv_frag_callback_match 
(pml_ob1_recvfrag.c:225)
==16715==by 0x2544110C: mca_btl_vader_check_fboxes 
(btl_vader_fbox.h:220)
==16715==by 0x25443577: mca_btl_vader_component_progress 
(btl_vader_component.c:695)

==16715==by 0x1F5F0F27: opal_progress (opal_progress.c:207)
==16715==by 0x1ACB40B3: opal_condition_wait (condition.h:93)
==16715==by 0x1ACB4201: ompi_request_wait_completion (request.h:381)
==16715==by 0x1ACB4305: ompi_request_default_wait (req_wait.c:39)
==16715==by 0x26BA2FFB: ompi_coll_tuned_bcast_intra_generic 
(coll_tuned_bcast.c:254)
==16715==by 0x26BA36F7: ompi_coll_tuned_bcast_intra_binomial 
(coll_tuned_bcast.c:385)
==16715==by 0x26B94289: ompi_coll_tuned_bcast_intra_dec_fixed 
(coll_tuned_decision_fixed.c:258)

==16715==by 0x1ACD55F2: PMPI_Bcast (pbcast.c:110)
==16715==by 0x2FE1CC48: ADIOI_Shfp_fname (shfp_fname.c:67)
==16715==by 0x2FDEB493: mca_io_romio_dist_MPI_File_open (open.c:177)
==16715==by 0x2FDE3B0D: mca_io_romio_file_open (io_romio_file_open.c:40)
==16715==by 0x1AD52344: module_init (io_base_file_select.c:455)
==16715==