Might be worth trying 1.8.3 to see if it works - there is an updated version of 
ROMIO in it.

> On Nov 25, 2014, at 12:13 PM, Eric Chamberland 
> <eric.chamberl...@giref.ulaval.ca> wrote:
> 
> Hi,
> 
> I have random segmentation violations (signal 11) in the mentioned function 
> when testing MPI I/O calls with 2 processes on a single machine.  Most of the 
> time (1499/1500), it works perfectly.
> 
> here are the call stacks (for 1.6.3) on processes:
> ====================
> process 0:
> ====================
> #0  0x00000035374cf287 in sched_yield () from /lib64/libc.so.6
> #1  0x00007ff73d158f4f in opal_progress () at runtime/opal_progress.c:220
> #2  0x00007ff73d0a6fc5 in opal_condition_wait (count=2, 
> requests=0x7fffe3ef7ca0, statuses=0x7fffe3ef7c70) at 
> ../opal/threads/condition.h:99
> #3  ompi_request_default_wait_all (count=2, requests=0x7fffe3ef7ca0, 
> statuses=0x7fffe3ef7c70) at request/req_wait.c:263
> #4  0x00007ff7348d365e in ompi_coll_tuned_sendrecv_actual (sendbuf=0x0, 
> scount=0, sdatatype=0x7ff73d3c0cc0, dest=1, stag=-16, recvbuf=<value 
> optimized out>, rcount=0, rdatatype=0x7ff73d3c0cc0, source=1,
>    rtag=-16, comm=0x5c21a50, status=0x0) at coll_tuned_util.c:54
> #5  0x00007ff7348db8ff in ompi_coll_tuned_barrier_intra_two_procs 
> (comm=<value optimized out>, module=<value optimized out>) at 
> coll_tuned_barrier.c:256
> #6  0x00007ff73d0b42d2 in PMPI_Barrier (comm=0x5c21a50) at pbarrier.c:70
> #7  0x00007ff7302a549c in mca_io_romio_dist_MPI_File_close (mpi_fh=0x47d9e70) 
> at close.c:62
> #8  0x00007ff73d0a15fe in file_destructor (file=0x4d7b270) at file/file.c:273
> #9  0x00007ff73d0a1519 in opal_obj_run_destructors (file=0x7fffe3ef8bb0) at 
> ../opal/class/opal_object.h:448
> #10 ompi_file_close (file=0x7fffe3ef8bb0) at file/file.c:146
> #11 0x00007ff73d0ce868 in PMPI_File_close (fh=0x7fffe3ef8bb0) at 
> pfile_close.c:59
> 
> ====================
> process 1:
> ====================
> ...
> #9  <signal handler called>
> #10 0x00000035374784fd in _int_free () from /lib64/libc.so.6
> #11 0x00007f37d777e493 in mca_io_romio_dist_MPI_File_close (mpi_fh=0x4d41c90) 
> at close.c:55
> #12 0x00007f37e457a5fe in file_destructor (file=0x4dbc9b0) at file/file.c:273
> #13 0x00007f37e457a519 in opal_obj_run_destructors (file=0x7fff7c2c94b0) at 
> ../opal/class/opal_object.h:448
> #14 ompi_file_close (file=0x7fff7c2c94b0) at file/file.c:146
> #15 0x00007f37e45a7868 in PMPI_File_close (fh=0x7fff7c2c94b0) at 
> pfile_close.c:59
> ...
> 
> The problematic free is:
> 
> 55              ADIOI_Free((fh)->shared_fp_fname);
> 
> Here are the values in the "fh" structure on both processes:
> 
> ====================
> process 0:
> ====================
> {cookie = 2487376, fd_sys = 12, fd_direct = -1, direct_read = 53, 
> direct_write = 1697919538, d_mem = 3158059, d_miniosz = 1702127872, fp_ind = 
> 11, fp_sys_posn = -1, fns = 0x7ff7304b2280, comm = 0x5c21a50,
>  agg_comm = 0x7ff73d3d4120, is_open = 1, is_agg = 1,
>  filename = 0x4d103a0 
> "/pmi/cmpbib/compilation_BIB_gcc_redhat_lance_validation/COMPILE_AUTO/TestValidation/Ressources/dev/Test.NormesEtProjectionChamp/Ressources.champscalhermite2dordre5incarete_elemtri_2proc/Resultats.Etal"...,
>  file_system = 152, access_mode = 2, disp = 0, etype = 0x7ff73d3c0cc0, 
> filetype = 0x7ff73d3c0cc0, etype_size = 1, hints = 0x4cffde0, info = 
> 0x5377610, split_coll_count = 0, split_status = {
>    MPI_SOURCE = 1681024372, MPI_TAG = 1919185519, MPI_ERROR = 1852388709, 
> _cancelled = 1701994851, _ucount = 8389473197092726132}, split_datatype = 
> 0x636f7270325f6972,
>  shared_fp_fname = 0x4d01810 "\330\376x75", shared_fp_fd = 0x0, async_count = 
> 0, perm = -1, atomicity = 0, fortran_handle = -1, err_handler = 
> 0x7ff73d3d55c0, fs_ptr = 0x0, file_realm_st_offs = 0x0,
>  file_realm_types = 0x0, my_cb_nodes_index = 0}
> 
> 
> ====================
> process 1:
> ====================
> print *fh
> $4 = {cookie = 2487376, fd_sys = 12, fd_direct = -1, direct_read = 0, 
> direct_write = 1697919538, d_mem = 3158059, d_miniosz = 1702127872, fp_ind = 
> 11, fp_sys_posn = -1, fns = 0x7f37d798b280, comm = 0x4db8060,
>  agg_comm = 0x7f37e48ad120, is_open = 1, is_agg = 0,
>  filename = 0x4d52b30 
> "/pmi/cmpbib/compilation_BIB_gcc_redhat_lance_validation/COMPILE_AUTO/TestValidation/Ressources/dev/Test.NormesEtProjectionChamp/Ressources.champscalhermite2dordre5incarete_elemtri_2proc/Resultats.Etal"...,
>  file_system = 152, access_mode = 2, disp = 0, etype = 0x7f37e4899cc0, 
> filetype = 0x7f37e4899cc0, etype_size = 1, hints = 0x45c5250, info = 
> 0x4d46750, split_coll_count = 0, split_status = {
>    MPI_SOURCE = 1681024372, MPI_TAG = 1919185519, MPI_ERROR = 1852388709, 
> _cancelled = 1701994851, _ucount = 168}, split_datatype = 0x7f37e489b0c0,
>  shared_fp_fname = 0x4806e20 
> "/pmi/cmpbib/compilation_BIB_gcc_redhat_lance_validation/COMPILE_AUTO/TestValidation/Ressources/dev/Test.NormesEtProjectionChamp/Ressources.champscalhermite2dordre5incarete_elemtri_2proc/Resultats.Etal"...,
>  shared_fp_fd = 0x0, async_count = 0, perm = -1, atomicity = 0, 
> fortran_handle = -1, err_handler = 0x7f37e48ae5c0, fs_ptr = 0x0, 
> file_realm_st_offs = 0x0, file_realm_types = 0x0,
>  my_cb_nodes_index = -1}
> 
> 
> For OpenMPI 1.6.5, I have also the problem occuring small number of times.
> 
> Here is the error, reported by gcc on process 1:
> 
> *** Error in `/home/mefpp_ericc/GIREF/bin/Test.NormesEtProjectionChamp.dev': 
> free(): invalid next size (normal): 0x000000000471cbc0 ***
> ======= Backtrace: =========
> /lib64/libc.so.6(+0x7afc6)[0x7f1082edffc6]
> /lib64/libc.so.6(+0x7bd43)[0x7f1082ee0d43]
> /opt/openmpi-1.6.5/lib64/libmpi.so.1(+0x630a1)[0x7f10847260a1]
> /opt/openmpi-1.6.5/lib64/libmpi.so.1(ompi_info_free+0x41)[0x7f10847264f1]
> /opt/openmpi-1.6.5/lib64/libmpi.so.1(PMPI_Info_free+0x47)[0x7f108473fd17]
> /opt/openmpi-1.6.5/lib64/openmpi/mca_io_romio.so(ADIO_Close+0x186)[0x7f107665f666]
> /opt/openmpi-1.6.5/lib64/openmpi/mca_io_romio.so(mca_io_romio_dist_MPI_File_close+0xf3)[0x7f107667fde3]
> /opt/openmpi-1.6.5/lib64/libmpi.so.1(+0x60856)[0x7f1084723856]
> /opt/openmpi-1.6.5/lib64/libmpi.so.1(ompi_file_close+0x41)[0x7f1084723d71]
> /opt/openmpi-1.6.5/lib64/libmpi.so.1(PMPI_File_close+0x78)[0x7f1084750588]
> 
> What can be wrong?  Is that fixed/changed in newer releases of OpenMPI?
> 
> Thanks,
> 
> Eric
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25866.php

Reply via email to