Hello all,
I am the main developer of the Scotch parallel graph partitioning package, which uses both MPI and Posix Pthreads. I have been doing a great deal of testing of my program on various platforms and libraries, searching for potential bugs (there may still be some ;-) ). The new memchecker tool proposed in v1.3 is indeed interesting to me, so I tried it on my Linux platform. I used the following configuration options : % ./configure --enable-debug --enable-mem-debug --enable-memchecker --with-valgrind=/usr/bin --enable-mpi-threads --prefix=/home/pastix/pelegrin/openmpi % ompi_info Package: Open MPI pelegrin@laurel Distribution Open MPI: 1.3b2 Open MPI SVN revision: r19927 Open MPI release date: Nov 04, 2008 Open RTE: 1.3b2 Open RTE SVN revision: r19927 Open RTE release date: Nov 04, 2008 OPAL: 1.3b2 OPAL SVN revision: r19927 OPAL release date: Nov 04, 2008 Ident string: 1.3b2 Prefix: /home/pastix/pelegrin/openmpi Configured architecture: x86_64-unknown-linux-gnu Configure host: laurel Configured by: pelegrin Configured on: Wed Nov 19 00:50:50 CET 2008 Configure host: laurel Built by: pelegrin Built on: mercredi 19 novembre 2008, 00:55:59 (UTC+0100) Built host: laurel C bindings: yes C++ bindings: yes Fortran77 bindings: yes (all) Fortran90 bindings: yes Fortran90 bindings size: small C compiler: gcc C compiler absolute: /usr/bin/gcc C++ compiler: g++ C++ compiler absolute: /usr/bin/g++ Fortran77 compiler: gfortran Fortran77 compiler abs: /usr/bin/gfortran Fortran90 compiler: gfortran Fortran90 compiler abs: /usr/bin/gfortran C profiling: yes C++ profiling: yes Fortran77 profiling: yes Fortran90 profiling: yes C++ exceptions: no Thread support: posix (mpi: yes, progress: no) Sparse Groups: no Internal debug support: yes MPI parameter check: runtime Memory profiling support: no Memory debugging support: yes libltdl support: yes Heterogeneous support: no mpirun default --prefix: no MPI I/O support: yes MPI_WTIME support: gettimeofday Symbol visibility support: yes FT Checkpoint support: no (checkpoint thread: no) MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.3) MCA memchecker: valgrind (MCA v2.0, API v2.0, Component v1.3) MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3) MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3) MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.3) MCA carto: file (MCA v2.0, API v2.0, Component v1.3) MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.3) MCA maffinity: libnuma (MCA v2.0, API v2.0, Component v1.3) MCA timer: linux (MCA v2.0, API v2.0, Component v1.3) MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3) MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3) MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3) MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3) MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3) MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.3) MCA coll: basic (MCA v2.0, API v2.0, Component v1.3) MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.3) MCA coll: inter (MCA v2.0, API v2.0, Component v1.3) MCA coll: self (MCA v2.0, API v2.0, Component v1.3) MCA coll: sm (MCA v2.0, API v2.0, Component v1.3) MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3) MCA io: romio (MCA v2.0, API v2.0, Component v1.3) MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.3) MCA mpool: sm (MCA v2.0, API v2.0, Component v1.3) MCA pml: cm (MCA v2.0, API v2.0, Component v1.3) MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.3) MCA pml: v (MCA v2.0, API v2.0, Component v1.3) MCA bml: r2 (MCA v2.0, API v2.0, Component v1.3) MCA rcache: vma (MCA v2.0, API v2.0, Component v1.3) MCA btl: self (MCA v2.0, API v2.0, Component v1.3) MCA btl: sm (MCA v2.0, API v2.0, Component v1.3) MCA btl: tcp (MCA v2.0, API v2.0, Component v1.3) MCA topo: unity (MCA v2.0, API v2.0, Component v1.3) MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.3) MCA osc: rdma (MCA v2.0, API v2.0, Component v1.3) MCA iof: hnp (MCA v2.0, API v2.0, Component v1.3) MCA iof: orted (MCA v2.0, API v2.0, Component v1.3) MCA iof: tool (MCA v2.0, API v2.0, Component v1.3) MCA oob: tcp (MCA v2.0, API v2.0, Component v1.3) MCA odls: default (MCA v2.0, API v2.0, Component v1.3) MCA ras: slurm (MCA v2.0, API v2.0, Component v1.3) MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.3) MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.3) MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.3) MCA rml: oob (MCA v2.0, API v2.0, Component v1.3) MCA routed: binomial (MCA v2.0, API v2.0, Component v1.3) MCA routed: direct (MCA v2.0, API v2.0, Component v1.3) MCA routed: linear (MCA v2.0, API v2.0, Component v1.3) MCA plm: rsh (MCA v2.0, API v2.0, Component v1.3) MCA plm: slurm (MCA v2.0, API v2.0, Component v1.3) MCA filem: rsh (MCA v2.0, API v2.0, Component v1.3) MCA errmgr: default (MCA v2.0, API v2.0, Component v1.3) MCA ess: env (MCA v2.0, API v2.0, Component v1.3) MCA ess: hnp (MCA v2.0, API v2.0, Component v1.3) MCA ess: singleton (MCA v2.0, API v2.0, Component v1.3) MCA ess: slurm (MCA v2.0, API v2.0, Component v1.3) MCA ess: tool (MCA v2.0, API v2.0, Component v1.3) MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.3) MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.3) % gcc --version gcc (Debian 4.3.2-1) 4.3.2 Copyright (C) 2008 Free Software Foundation, Inc. I launched my program under valgrind on two procs and got the following report: % mpirun -np 2 valgrind ../bin/dgord ~/paral/graph/altr4.grf.gz /dev/null -vt ==10978== Memcheck, a memory error detector. ==10978== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et al. ==10978== Using LibVEX rev 1854, a library for dynamic binary translation. ==10978== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP. ==10978== Using valgrind-3.3.1-Debian, a dynamic binary instrumentation framework. ==10978== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et al. ==10978== For more details, rerun with: -v ==10978== ==10979== Memcheck, a memory error detector. ==10979== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et al. ==10979== Using LibVEX rev 1854, a library for dynamic binary translation. ==10979== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP. ==10979== Using valgrind-3.3.1-Debian, a dynamic binary instrumentation framework. ==10979== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et al. ==10979== For more details, rerun with: -v ==10979== ==10979== Syscall param sched_setaffinity(mask) points to unaddressable byte(s)==10978== Syscall param sched_setaffinity(mask) points to unaddressable byte(s) ==10978== at 0x65FB269: syscall (in /lib/libc-2.7.so) ==10978== by 0x6C8365A: opal_paffinity_linux_plpa_api_probe_init (plpa_api_probe.c:43) ==10978== by 0x6C83BB8: opal_paffinity_linux_plpa_init (plpa_runtime.c:36) ==10978== by 0x6C84984: opal_paffinity_linux_plpa_have_topology_information (plpa_map.c:501) ==10978== by 0x6C83129: linux_module_init (paffinity_linux_module.c:119) ==10978== by 0x5AB35EA: opal_paffinity_base_select (paffinity_base_select.c:64) ==10978== by 0x5A7DE99: opal_init (opal_init.c:292) ==10978== by 0x580087A: orte_init (orte_init.c:76) ==10978== by 0x551675F: ompi_mpi_init (ompi_mpi_init.c:343) ==10978== by 0x55546B9: PMPI_Init_thread (pinit_thread.c:82) ==10978== by 0x4067CF: main (dgord.c:123) ==10978== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==10979== at 0x65FB269: syscall (in /lib/libc-2.7.so) ==10979== by 0x6C8365A: opal_paffinity_linux_plpa_api_probe_init (plpa_api_probe.c:43) ==10979== by 0x6C83BB8: opal_paffinity_linux_plpa_init (plpa_runtime.c:36) ==10979== by 0x6C84984: opal_paffinity_linux_plpa_have_topology_information (plpa_map.c:501) ==10979== by 0x6C83129: linux_module_init (paffinity_linux_module.c:119) ==10979== by 0x5AB35EA: opal_paffinity_base_select (paffinity_base_select.c:64) ==10979== by 0x5A7DE99: opal_init (opal_init.c:292) ==10979== by 0x580087A: orte_init (orte_init.c:76) ==10979== by 0x551675F: ompi_mpi_init (ompi_mpi_init.c:343) ==10979== by 0x55546B9: PMPI_Init_thread (pinit_thread.c:82) ==10979== by 0x4067CF: main (dgord.c:123) ==10979== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==10978== Warning: set address range perms: large range 134217728 (defined) ==10979== Warning: set address range perms: large range 134217728 (defined) ==10978== ==10978== Conditional jump or move depends on uninitialised value(s) ==10978== at 0x88E8F1A: mca_mpool_sm_alloc (mpool_sm_module.c:80) ==10978== by 0x972EBD4: mpool_calloc (btl_sm.c:109) ==10978== by 0x972F6A8: sm_btl_first_time_init (btl_sm.c:314) ==10978== by 0x972FCE3: mca_btl_sm_add_procs (btl_sm.c:488) ==10978== by 0x9322C33: mca_bml_r2_add_procs (bml_r2.c:206) ==10978== by 0x8EFD000: mca_pml_ob1_add_procs (pml_ob1.c:237) ==10978== by 0x5516F4C: ompi_mpi_init (ompi_mpi_init.c:678) ==10978== by 0x55546B9: PMPI_Init_thread (pinit_thread.c:82) ==10978== by 0x4067CF: main (dgord.c:123) ==10978== ==10978== Conditional jump or move depends on uninitialised value(s) ==10978== at 0x88E8F1A: mca_mpool_sm_alloc (mpool_sm_module.c:80) ==10978== by 0x972EBD4: mpool_calloc (btl_sm.c:109) ==10978== by 0x972EC85: init_fifos (btl_sm.c:125) ==10978== by 0x972F6CB: sm_btl_first_time_init (btl_sm.c:317) ==10978== by 0x972FCE3: mca_btl_sm_add_procs (btl_sm.c:488) ==10978== by 0x9322C33: mca_bml_r2_add_procs (bml_r2.c:206) ==10978== by 0x8EFD000: mca_pml_ob1_add_procs (pml_ob1.c:237) ==10978== by 0x5516F4C: ompi_mpi_init (ompi_mpi_init.c:678) ==10978== by 0x55546B9: PMPI_Init_thread (pinit_thread.c:82) ==10978== by 0x4067CF: main (dgord.c:123) ==10978== ==10978== Conditional jump or move depends on uninitialised value(s) ==10978== at 0x88E8F1A: mca_mpool_sm_alloc (mpool_sm_module.c:80) ==10978== by 0x54E660E: ompi_free_list_grow (ompi_free_list.c:198) ==10978== by 0x54E6435: ompi_free_list_init_ex_new (ompi_free_list.c:163) ==10978== by 0x972F9D3: ompi_free_list_init_new (ompi_free_list.h:169) ==10978== by 0x972F864: sm_btl_first_time_init (btl_sm.c:343) ==10978== by 0x972FCE3: mca_btl_sm_add_procs (btl_sm.c:488) ==10978== by 0x9322C33: mca_bml_r2_add_procs (bml_r2.c:206) ==10978== by 0x8EFD000: mca_pml_ob1_add_procs (pml_ob1.c:237) ==10978== by 0x5516F4C: ompi_mpi_init (ompi_mpi_init.c:678) ==10978== by 0x55546B9: PMPI_Init_thread (pinit_thread.c:82) ==10978== by 0x4067CF: main (dgord.c:123) ==10979== ==10979== Conditional jump or move depends on uninitialised value(s) ==10979== at 0x88E8F1A: mca_mpool_sm_alloc (mpool_sm_module.c:80) ==10979== by 0x972EBD4: mpool_calloc (btl_sm.c:109) ==10979== by 0x972F6A8: sm_btl_first_time_init (btl_sm.c:314) ==10979== by 0x972FCE3: mca_btl_sm_add_procs (btl_sm.c:488) ==10979== by 0x9322C33: mca_bml_r2_add_procs (bml_r2.c:206) ==10979== by 0x8EFD000: mca_pml_ob1_add_procs (pml_ob1.c:237) ==10979== by 0x5516F4C: ompi_mpi_init (ompi_mpi_init.c:678) ==10979== by 0x55546B9: PMPI_Init_thread (pinit_thread.c:82) ==10979== by 0x4067CF: main (dgord.c:123) ==10979== ==10979== Conditional jump or move depends on uninitialised value(s) ==10979== at 0x88E8F1A: mca_mpool_sm_alloc (mpool_sm_module.c:80) ==10979== by 0x972EBD4: mpool_calloc (btl_sm.c:109) ==10979== by 0x972EC85: init_fifos (btl_sm.c:125) ==10979== by 0x972F6CB: sm_btl_first_time_init (btl_sm.c:317) ==10979== by 0x972FCE3: mca_btl_sm_add_procs (btl_sm.c:488) ==10979== by 0x9322C33: mca_bml_r2_add_procs (bml_r2.c:206) ==10979== by 0x8EFD000: mca_pml_ob1_add_procs (pml_ob1.c:237) ==10979== by 0x5516F4C: ompi_mpi_init (ompi_mpi_init.c:678) ==10979== by 0x55546B9: PMPI_Init_thread (pinit_thread.c:82) ==10979== by 0x4067CF: main (dgord.c:123) ==10979== ==10979== Conditional jump or move depends on uninitialised value(s) ==10979== at 0x88E8F1A: mca_mpool_sm_alloc (mpool_sm_module.c:80) ==10979== by 0x54E660E: ompi_free_list_grow (ompi_free_list.c:198) ==10979== by 0x54E6435: ompi_free_list_init_ex_new (ompi_free_list.c:163) ==10979== by 0x972F9D3: ompi_free_list_init_new (ompi_free_list.h:169) ==10979== by 0x972F864: sm_btl_first_time_init (btl_sm.c:343) ==10979== by 0x972FCE3: mca_btl_sm_add_procs (btl_sm.c:488) ==10979== by 0x9322C33: mca_bml_r2_add_procs (bml_r2.c:206) ==10979== by 0x8EFD000: mca_pml_ob1_add_procs (pml_ob1.c:237) ==10979== by 0x5516F4C: ompi_mpi_init (ompi_mpi_init.c:678) ==10979== by 0x55546B9: PMPI_Init_thread (pinit_thread.c:82) ==10979== by 0x4067CF: main (dgord.c:123) ==10979== ==10979== Conditional jump or move depends on uninitialised value(s) ==10979== at 0x88E8F1A: mca_mpool_sm_alloc (mpool_sm_module.c:80) ==10979== by 0x9730165: ompi_fifo_init (ompi_fifo.h:280) ==10979== by 0x9730044: mca_btl_sm_add_procs (btl_sm.c:538) ==10979== by 0x9322C33: mca_bml_r2_add_procs (bml_r2.c:206) ==10979== by 0x8EFD000: mca_pml_ob1_add_procs (pml_ob1.c:237) ==10979== by 0x5516F4C: ompi_mpi_init (ompi_mpi_init.c:678) ==10979== by 0x55546B9: PMPI_Init_thread (pinit_thread.c:82) ==10979== by 0x4067CF: main (dgord.c:123) ==10979== ==10979== Conditional jump or move depends on uninitialised value(s) ==10979== at 0x88E8F1A: mca_mpool_sm_alloc (mpool_sm_module.c:80) ==10979== by 0x97302C4: ompi_cb_fifo_init (ompi_circular_buffer_fifo.h:158) ==10979== by 0x97301BA: ompi_fifo_init (ompi_fifo.h:288) ==10979== by 0x9730044: mca_btl_sm_add_procs (btl_sm.c:538) ==10979== by 0x9322C33: mca_bml_r2_add_procs (bml_r2.c:206) ==10979== by 0x8EFD000: mca_pml_ob1_add_procs (pml_ob1.c:237) ==10979== by 0x5516F4C: ompi_mpi_init (ompi_mpi_init.c:678) ==10979== by 0x55546B9: PMPI_Init_thread (pinit_thread.c:82) ==10979== by 0x4067CF: main (dgord.c:123) ==10979== ==10979== Conditional jump or move depends on uninitialised value(s) ==10979== at 0x88E8F1A: mca_mpool_sm_alloc (mpool_sm_module.c:80) ==10979== by 0x97303B3: ompi_cb_fifo_init (ompi_circular_buffer_fifo.h:180) ==10979== by 0x97301BA: ompi_fifo_init (ompi_fifo.h:288) ==10979== by 0x9730044: mca_btl_sm_add_procs (btl_sm.c:538) ==10979== by 0x9322C33: mca_bml_r2_add_procs (bml_r2.c:206) ==10979== by 0x8EFD000: mca_pml_ob1_add_procs (pml_ob1.c:237) ==10979== by 0x5516F4C: ompi_mpi_init (ompi_mpi_init.c:678) ==10979== by 0x55546B9: PMPI_Init_thread (pinit_thread.c:82) ==10979== by 0x4067CF: main (dgord.c:123) ==10978== ==10978== Conditional jump or move depends on uninitialised value(s) ==10978== at 0x88E8F1A: mca_mpool_sm_alloc (mpool_sm_module.c:80) ==10978== by 0x9730165: ompi_fifo_init (ompi_fifo.h:280) ==10978== by 0x9730044: mca_btl_sm_add_procs (btl_sm.c:538) ==10978== by 0x9322C33: mca_bml_r2_add_procs (bml_r2.c:206) ==10978== by 0x8EFD000: mca_pml_ob1_add_procs (pml_ob1.c:237) ==10978== by 0x5516F4C: ompi_mpi_init (ompi_mpi_init.c:678) ==10978== by 0x55546B9: PMPI_Init_thread (pinit_thread.c:82) ==10978== by 0x4067CF: main (dgord.c:123) ==10978== ==10978== Conditional jump or move depends on uninitialised value(s) ==10978== at 0x88E8F1A: mca_mpool_sm_alloc (mpool_sm_module.c:80) ==10978== by 0x97302C4: ompi_cb_fifo_init (ompi_circular_buffer_fifo.h:158) ==10978== by 0x97301BA: ompi_fifo_init (ompi_fifo.h:288) ==10978== by 0x9730044: mca_btl_sm_add_procs (btl_sm.c:538) ==10978== by 0x9322C33: mca_bml_r2_add_procs (bml_r2.c:206) ==10978== by 0x8EFD000: mca_pml_ob1_add_procs (pml_ob1.c:237) ==10978== by 0x5516F4C: ompi_mpi_init (ompi_mpi_init.c:678) ==10978== by 0x55546B9: PMPI_Init_thread (pinit_thread.c:82) ==10978== by 0x4067CF: main (dgord.c:123) ==10978== ==10978== Conditional jump or move depends on uninitialised value(s) ==10978== at 0x88E8F1A: mca_mpool_sm_alloc (mpool_sm_module.c:80) ==10978== by 0x97303B3: ompi_cb_fifo_init (ompi_circular_buffer_fifo.h:180) ==10978== by 0x97301BA: ompi_fifo_init (ompi_fifo.h:288) ==10978== by 0x9730044: mca_btl_sm_add_procs (btl_sm.c:538) ==10978== by 0x9322C33: mca_bml_r2_add_procs (bml_r2.c:206) ==10978== by 0x8EFD000: mca_pml_ob1_add_procs (pml_ob1.c:237) ==10978== by 0x5516F4C: ompi_mpi_init (ompi_mpi_init.c:678) ==10978== by 0x55546B9: PMPI_Init_thread (pinit_thread.c:82) ==10978== by 0x4067CF: main (dgord.c:123) ==10979== ==10979== Uninitialised byte(s) found during client check request ==10979== at 0x5AB2C87: valgrind_module_isdefined (memchecker_valgrind_module.c:112) ==10979== by 0x5AB26CB: opal_memchecker_base_isdefined (memchecker_base_wrappers.c:34) ==10979== by 0x553F067: memchecker_call (memchecker.h:96) ==10979== by 0x553ECDF: PMPI_Bcast (pbcast.c:41) ==10979== by 0x40BB82: _SCOTCHdgraphLoad (dgraph_io_load.c:226) ==10979== by 0x406B32: main (dgord.c:265) ==10979== Address 0x7feffff74 is on thread 1's stack ==10978== ==10978== Invalid read of size 8 ==10978== at 0x8F0D85A: memchecker_call (memchecker.h:94) ==10978== by 0x8F0D812: mca_pml_ob1_send_request_free (pml_ob1_sendreq.c:107) ==10978== by 0x55154DA: ompi_request_free (request.h:354) ==10978== by 0x5515B05: ompi_request_default_wait_all (req_wait.c:344) ==10978== by 0x556EA00: PMPI_Waitall (pwaitall.c:68) ==10978== by 0x41FC2D: _SCOTCHdgraphCoarsen (dgraph_coarsen.c:711) ==10978== by 0x415B34: vdgraphSeparateMl2 (vdgraph_separate_ml.c:99)==10979== ==10979== Invalid read of size 8 ==10979== at 0x8F0D85A: memchecker_call (memchecker.h:94) ==10979== by 0x8F0D812: mca_pml_ob1_send_request_free (pml_ob1_sendreq.c:107) ==10979== by 0x55154DA: ompi_request_free (request.h:354) ==10979== by 0x5515B05: ompi_request_default_wait_all (req_wait.c:344) ==10979== by 0x556EA00: PMPI_Waitall (pwaitall.c:68) ==10979== by 0x41FC2D: _SCOTCHdgraphCoarsen (dgraph_coarsen.c:711) ==10979== by 0x415B34: vdgraphSeparateMl2 (vdgraph_separate_ml.c:99) ==10979== by 0x415D96: _SCOTCHvdgraphSeparateMl (vdgraph_separate_ml.c:660) ==10979== by 0x40EC49: _SCOTCHvdgraphSeparateSt (vdgraph_separate_st.c:327) ==10979== by 0x412F52: _SCOTCHhdgraphOrderNd (hdgraph_order_nd.c:294) ==10979== by 0x40EB26: _SCOTCHhdgraphOrderSt (hdgraph_order_st.c:216) ==10979== by 0x40734A: SCOTCH_dgraphOrderComputeList (library_dgraph_order.c:220) ==10979== Address 0x28 is not stack'd, malloc'd or (recently) free'd ==10978== by 0x415D96: _SCOTCHvdgraphSeparateMl (vdgraph_separate_ml.c:660) ==10978== by 0x40EC49: _SCOTCHvdgraphSeparateSt (vdgraph_separate_st.c:327) ==10978== by 0x412F52: _SCOTCHhdgraphOrderNd (hdgraph_order_nd.c:294) ==10978== by 0x40EB26: _SCOTCHhdgraphOrderSt (hdgraph_order_st.c:216) ==10978== by 0x40734A: SCOTCH_dgraphOrderComputeList (library_dgraph_order.c:220) ==10978== Address 0x28 is not stack'd, malloc'd or (recently) free'd [laurel:10979] *** Process received signal *** [laurel:10978] *** Process received signal *** [laurel:10979] Signal: Segmentation fault (11) [laurel:10979] Signal code: Address not mapped (1) [laurel:10979] Failing at address: 0x28 [laurel:10978] Signal: Segmentation fault (11) [laurel:10978] Signal code: Address not mapped (1) [laurel:10978] Failing at address: 0x28 [laurel:10979] [ 0] /lib/libpthread.so.0 [0x6321a80] [laurel:10979] [ 1] /home/pastix/pelegrin/openmpi/lib/openmpi/mca_pml_ob1.so [0x8f0d85a] [laurel:10979] [ 2] /home/pastix/pelegrin/openmpi/lib/openmpi/mca_pml_ob1.so [0x8f0d813] [laurel:10979] [ 3] /home/pastix/pelegrin/openmpi/lib/libmpi.so.0 [0x55154db] [laurel:10979] [ 4] /home/pastix/pelegrin/openmpi/lib/libmpi.so.0 [0x5515b06] [laurel:10979] [ 5] /home/pastix/pelegrin/openmpi/lib/libmpi.so.0(PMPI_Waitall+0x15d) [0x556ea01] [laurel:10979] [ 6] ../bin/dgord(_SCOTCHdgraphCoarsen+0x13ce) [0x41fc2e] [laurel:10979] [ 7] ../bin/dgord [0x415b35] [laurel:10979] [ 8] ../bin/dgord(_SCOTCHvdgraphSeparateMl+0x27) [0x415d97] [laurel:10979] [ 9] ../bin/dgord(_SCOTCHvdgraphSeparateSt+0x5a) [0x40ec4a] [laurel:10978] [ 0] /lib/libpthread.so.0 [0x6321a80] [laurel:10979] [10] ../bin/dgord(_SCOTCHhdgraphOrderNd+0xe3) [0x412f53] [laurel:10979] [11] ../bin/dgord(_SCOTCHhdgraphOrderSt+0x67) [0x40eb27] [laurel:10978] [ 1] /home/pastix/pelegrin/openmpi/lib/openmpi/mca_pml_ob1.so [0x8f0d85a] [laurel:10979] [12] ../bin/dgord(SCOTCH_dgraphOrderComputeList+0xeb) [0x40734b] [laurel:10979] [13] ../bin/dgord(main+0x3ec) [0x406b7c] [laurel:10979] [14] /lib/libc.so.6(__libc_start_main+0xe6) [0x654d1a6] [laurel:10979] [15] ../bin/dgord [0x406669] [laurel:10978] [ 2] /home/pastix/pelegrin/openmpi/lib/openmpi/mca_pml_ob1.so [0x8f0d813] [laurel:10978] [ 3] /home/pastix/pelegrin/openmpi/lib/libmpi.so.0 [0x55154db] [laurel:10978] [ 4] /home/pastix/pelegrin/openmpi/lib/libmpi.so.0 [0x5515b06] [laurel:10978] [ 5] /home/pastix/pelegrin/openmpi/lib/libmpi.so.0(PMPI_Waitall+0x15d) [0x556ea01] [laurel:10978] [ 6] ../bin/dgord(_SCOTCHdgraphCoarsen+0x13ce) [0x41fc2e] [laurel:10978] [ 7] ../bin/dgord [0x415b35] [laurel:10978] [ 8] ../bin/dgord(_SCOTCHvdgraphSeparateMl+0x27) [0x415d97] [laurel:10978] [ 9] ../bin/dgord(_SCOTCHvdgraphSeparateSt+0x5a) [0x40ec4a] [laurel:10979] *** End of error message *** [laurel:10978] [10] ../bin/dgord(_SCOTCHhdgraphOrderNd+0xe3) [0x412f53] [laurel:10978] [11] ../bin/dgord(_SCOTCHhdgraphOrderSt+0x67) [0x40eb27] [laurel:10978] [12] ../bin/dgord(SCOTCH_dgraphOrderComputeList+0xeb) [0x40734b] [laurel:10978] [13] ../bin/dgord(main+0x3ec) [0x406b7c] [laurel:10978] [14] /lib/libc.so.6(__libc_start_main+0xe6) [0x654d1a6] [laurel:10978] [15] ../bin/dgord [0x406669] ==10979== [laurel:10978] *** End of error message *** ==10979== Process terminating with default action of signal 11 (SIGSEGV) ==10979== Access not within mapped region at address 0x29 ==10979== at 0x8F0D85A: memchecker_call (memchecker.h:94) ==10978== ==10978== Process terminating with default action of signal 11 (SIGSEGV) ==10978== Access not within mapped region at address 0x29 ==10978== at 0x8F0D85A: memchecker_call (memchecker.h:94) ==10978== by 0x8F0D812: mca_pml_ob1_send_request_free (pml_ob1_sendreq.c:107) ==10978== by 0x55154DA: ompi_request_free (request.h:354) ==10978== by 0x5515B05: ompi_request_default_wait_all (req_wait.c:344) ==10978== by 0x556EA00: PMPI_Waitall (pwaitall.c:68) ==10978== by 0x41FC2D: _SCOTCHdgraphCoarsen (dgraph_coarsen.c:711) ==10978== by 0x415B34: vdgraphSeparateMl2 (vdgraph_separate_ml.c:99) ==10978== by 0x415D96: _SCOTCHvdgraphSeparateMl (vdgraph_separate_ml.c:660) ==10978== by 0x40EC49: _SCOTCHvdgraphSeparateSt (vdgraph_separate_st.c:327) ==10978== by 0x412F52: _SCOTCHhdgraphOrderNd (hdgraph_order_nd.c:294) ==10978== by 0x40EB26: _SCOTCHhdgraphOrderSt (hdgraph_order_st.c:216) ==10978== by 0x40734A: SCOTCH_dgraphOrderComputeList (library_dgraph_order.c:220) ==10979== by 0x8F0D812: mca_pml_ob1_send_request_free (pml_ob1_sendreq.c:107) ==10979== by 0x55154DA: ompi_request_free (request.h:354) ==10979== by 0x5515B05: ompi_request_default_wait_all (req_wait.c:344) ==10979== by 0x556EA00: PMPI_Waitall (pwaitall.c:68) ==10979== by 0x41FC2D: _SCOTCHdgraphCoarsen (dgraph_coarsen.c:711) ==10979== by 0x415B34: vdgraphSeparateMl2 (vdgraph_separate_ml.c:99) ==10979== by 0x415D96: _SCOTCHvdgraphSeparateMl (vdgraph_separate_ml.c:660) ==10979== by 0x40EC49: _SCOTCHvdgraphSeparateSt (vdgraph_separate_st.c:327) ==10979== by 0x412F52: _SCOTCHhdgraphOrderNd (hdgraph_order_nd.c:294) ==10979== by 0x40EB26: _SCOTCHhdgraphOrderSt (hdgraph_order_st.c:216) ==10979== by 0x40734A: SCOTCH_dgraphOrderComputeList (library_dgraph_order.c:220) ==10979== ==10979== ERROR SUMMARY: 14 errors from 9 contexts (suppressed: 264 from 3) ==10979== malloc/free: in use at exit: 4,626,295 bytes in 2,614 blocks.==10978== ==10978== ERROR SUMMARY: 13 errors from 8 contexts (suppressed: 264 from 3) ==10979== malloc/free: 9,296 allocs, 6,682 frees, 11,121,335 bytes allocated. ==10979== For counts of detected errors, rerun with: -v==10978== malloc/free: in use at exit: 4,671,068 bytes in 2,627 blocks. ==10978== malloc/free: 9,315 allocs, 6,688 frees, 13,108,494 bytes allocated. ==10978== For counts of detected errors, rerun with: -v ====10979== searching for pointers to 2,614 not-freed blocks. 10978== searching for pointers to 2,627 not-freed blocks. ==10978== checked 138,090,848 bytes. ==10978== ==10979== checked 138,047,136 bytes. ==10979== ==10979== LEAK SUMMARY: ==10979== definitely lost: 2,049 bytes in 25 blocks. ==10979== possibly lost: 2,405,098 bytes in 60 blocks. ==10979== still reachable: 2,219,148 bytes in 2,529 blocks. ==10979== suppressed: 0 bytes in 0 blocks. ==10979== Rerun with --leak-check=full to see details of leaked memory. ==10978== LEAK SUMMARY: ==10978== definitely lost: 2,125 bytes in 27 blocks. ==10978== possibly lost: 2,445,353 bytes in 63 blocks. ==10978== still reachable: 2,223,590 bytes in 2,537 blocks. ==10978== suppressed: 0 bytes in 0 blocks. ==10978== Rerun with --leak-check=full to see details of leaked memory. -------------------------------------------------------------------------- mpirun noticed that process rank 1 with PID 10979 on node laurel exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- I want to report the following issues : 1)- The "Conditional jump or move depends on uninitialised value(s)" messages are quite puzzling. Do they correspond to real problems in OpenMPI or should they just be ignored ? 2)- The MPI_Waitall call which causes the problem spans across a set of former receive requests already set to MPI_REQUEST_NULL, and to a set of matching (and hence matched) Isend requests. 3)- Memchecker also complaints (I think wrongfully) in the case of a Bcast where the receivers have not pre-set all of their receive array. I guess in the memcheck process the sender side and the receiver sides should get different treatment, since only one data array is passed, which is either to be read or written depending on the root process number. 4)- It also complaints when two Isend's correspond to overlapping regions of the same memory area. It seems that the first Isend sets flags on the region as "non-readable", while it should just be "non-writeable", isn't it ? 5)- Keep doing the good job ! Congrats. ;-) Sincerely yours, f.p.