Re: [OMPI users] Segfault when using valgrind

Jeff Squyres Thu, 9 Jul 2009 07:16:27 -0400

On Jul 7, 2009, at 11:47 AM, Justin wrote:

(Sorry if this is posted twice, I sent the same email yesterday but it
never appeared on the list).

Sorry for the delay in replying. FWIW, I got your original message aswell.

Hi,  I am attempting to debug a memory corruption in an mpi program
using valgrind.  However, when I run with valgrind I get semi-random
segfaults and valgrind messages with the openmpi library.  Here is an
example of such a seg fault:

==6153==
==6153== Invalid read of size 8
==6153==    at 0x19102EA0: (within /usr/lib/openmpi/lib/openmpi/
mca_btl_sm.so)

...

==6153==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
^G^G^GThread "main"(pid 6153) caught signal SIGSEGV at address (nil)
(segmentation violation)

Looking at the code for our isend at SFC.h:298 does not seem to haveany

errors:

=============================================
  MergeInfo<BITS> myinfo,theirinfo;

  MPI_Request srequest, rrequest;
  MPI_Status status;

  myinfo.n=n;
  if(n!=0)
  {
    myinfo.min=sendbuf[0].bits;
    myinfo.max=sendbuf[n-1].bits;
  }
  //cout << rank << " n:" << n << " min:" << (int)myinfo.min << "max:"
<< (int)myinfo.max << endl;

MPI_Isend(&myinfo,sizeof(MergeInfo<BITS>),MPI_BYTE,to,0,Comm,&srequest);

==============================================

myinfo is a struct located on the stack, to is the rank of theprocessor

that the message is being sent to, and srequest is also on the stack.
In addition this message is waited on prior to exiting this block of
code so they still exist on the stack.  When I don't run with valgrind
my program runs past this point just fine.

Strange. I can't think of an immediate reason as to why this wouldhappen -- does it also happen if you use a blocking send (vs. anIsend)? Is myinfo a complex object, or a variable-length object?

I am currently using openmpi 1.3 from the debian unstable branch.  I
also see the same type of segfault in a different portion of the code
involving an MPI_Allgather which can be seen below:

==============================================
==22736== Use of uninitialised value of size 8

==22736== at 0x19104775: mca_btl_sm_component_progress(opal_list.h:322)

==22736==    by 0x1382CE09: opal_progress (opal_progress.c:207)

==22736== by 0xB404264: ompi_request_default_wait_all(condition.h:99)

==22736==    by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual
(coll_tuned_util.c:55)
==22736==    by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck
(coll_tuned_util.h:60)
==22736==    by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
==22736==    by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)

==22736== by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)

==22736==    by 0x6465457:
Uintah::Grid::problemSetup(Uintah::Handle<Uintah::ProblemSpec> const&,
Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
==22736==    by 0x8345759: Uintah::SimulationController::gridSetup()
(SimulationController.cc:243)
==22736==    by 0x834F418: Uintah::AMRSimulationController::run()
(AMRSimulationController.cc:117)
==22736==    by 0x4089AE: main (sus.cc:629)
==22736==
==22736== Invalid read of size 8

==22736== at 0x19104775: mca_btl_sm_component_progress(opal_list.h:322)

==22736==    by 0x1382CE09: opal_progress (opal_progress.c:207)

==22736== by 0xB404264: ompi_request_default_wait_all(condition.h:99)

==22736==    by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual
(coll_tuned_util.c:55)
==22736==    by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck
(coll_tuned_util.h:60)
==22736==    by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
==22736==    by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)

==22736== by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)

==22736==    by 0x6465457:
Uintah::Grid::problemSetup(Uintah::Handle<Uintah::ProblemSpec> const&,
Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
==22736==    by 0x8345759: Uintah::SimulationController::gridSetup()
(SimulationController.cc:243)
==22736==    by 0x834F418: Uintah::AMRSimulationController::run()
(AMRSimulationController.cc:117)
==22736==    by 0x4089AE: main (sus.cc:629)
================================================================

Are these problems with openmpi and is there any known work arounds?

These are new to me. The problem does seem to occur with OMPI'sshared memory device; you might want to try a different point-to-pointdevice (e.g., tcp?) to see if the problem goes away. But be awarethat the problem "going away" does not really pinpoint the location ofthe problem -- moving to a slower transport (like tcp) may simplychange timing such that the problem does not occur. I.e., the problemcould still exist in either your code or OMPI -- this would simply bea workaround.


--
Jeff Squyres
Cisco Systems

Re: [OMPI users] Segfault when using valgrind

Reply via email to