Re: [OMPI users] Segfault when using valgrind

2009-07-09 Thread Justin Luitjens
I was able to get rid of  the segfaults/invalid reads by disabling the
shared memory path.  They still reported an error with uninitialized memory
in the same spot which I believe is due to the struct being padded for
alignment.  I added a supression and was able to get past this part just
fine.

Thanks,
Justin

On Thu, Jul 9, 2009 at 5:16 AM, Jeff Squyres  wrote:

> On Jul 7, 2009, at 11:47 AM, Justin wrote:
>
>  (Sorry if this is posted twice, I sent the same email yesterday but it
>> never appeared on the list).
>>
>>
> Sorry for the delay in replying.  FWIW, I got your original message as
> well.
>
>  Hi,  I am attempting to debug a memory corruption in an mpi program
>> using valgrind.  However, when I run with valgrind I get semi-random
>> segfaults and valgrind messages with the openmpi library.  Here is an
>> example of such a seg fault:
>>
>> ==6153==
>> ==6153== Invalid read of size 8
>> ==6153==at 0x19102EA0: (within /usr/lib/openmpi/lib/openmpi/
>> mca_btl_sm.so)
>>
>>  ...
>
>> ==6153==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
>> ^G^G^GThread "main"(pid 6153) caught signal SIGSEGV at address (nil)
>> (segmentation violation)
>>
>> Looking at the code for our isend at SFC.h:298 does not seem to have any
>> errors:
>>
>> =
>>  MergeInfo myinfo,theirinfo;
>>
>>  MPI_Request srequest, rrequest;
>>  MPI_Status status;
>>
>>  myinfo.n=n;
>>  if(n!=0)
>>  {
>>myinfo.min=sendbuf[0].bits;
>>myinfo.max=sendbuf[n-1].bits;
>>  }
>>  //cout << rank << " n:" << n << " min:" << (int)myinfo.min << "max:"
>> << (int)myinfo.max << endl;
>>
>>  MPI_Isend(,sizeof(MergeInfo),MPI_BYTE,to,0,Comm,);
>> ==
>>
>> myinfo is a struct located on the stack, to is the rank of the processor
>> that the message is being sent to, and srequest is also on the stack.
>> In addition this message is waited on prior to exiting this block of
>> code so they still exist on the stack.  When I don't run with valgrind
>> my program runs past this point just fine.
>>
>>
> Strange.  I can't think of an immediate reason as to why this would happen
> -- does it also happen if you use a blocking send (vs. an Isend)?  Is myinfo
> a complex object, or a variable-length object?
>
>
>  I am currently using openmpi 1.3 from the debian unstable branch.  I
>> also see the same type of segfault in a different portion of the code
>> involving an MPI_Allgather which can be seen below:
>>
>> ==
>> ==22736== Use of uninitialised value of size 8
>> ==22736==at 0x19104775: mca_btl_sm_component_progress
>> (opal_list.h:322)
>> ==22736==by 0x1382CE09: opal_progress (opal_progress.c:207)
>> ==22736==by 0xB404264: ompi_request_default_wait_all (condition.h:99)
>> ==22736==by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual
>> (coll_tuned_util.c:55)
>> ==22736==by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck
>> (coll_tuned_util.h:60)
>> ==22736==by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
>> ==22736==by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
>> ==22736==by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)
>> ==22736==by 0x6465457:
>> Uintah::Grid::problemSetup(Uintah::Handle const&,
>> Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
>> ==22736==by 0x8345759: Uintah::SimulationController::gridSetup()
>> (SimulationController.cc:243)
>> ==22736==by 0x834F418: Uintah::AMRSimulationController::run()
>> (AMRSimulationController.cc:117)
>> ==22736==by 0x4089AE: main (sus.cc:629)
>> ==22736==
>> ==22736== Invalid read of size 8
>> ==22736==at 0x19104775: mca_btl_sm_component_progress
>> (opal_list.h:322)
>> ==22736==by 0x1382CE09: opal_progress (opal_progress.c:207)
>> ==22736==by 0xB404264: ompi_request_default_wait_all (condition.h:99)
>> ==22736==by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual
>> (coll_tuned_util.c:55)
>> ==22736==by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck
>> (coll_tuned_util.h:60)
>> ==22736==by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
>> ==22736==by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
>> ==22736==by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)
>> ==22736==by 0x6465457:
>> Uintah::Grid::problemSetup(Uintah::Handle const&,
>> Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
>> ==22736==by 0x8345759: Uintah::SimulationController::gridSetup()
>> (SimulationController.cc:243)
>> ==22736==by 0x834F418: Uintah::AMRSimulationController::run()
>> (AMRSimulationController.cc:117)
>> ==22736==by 0x4089AE: main (sus.cc:629)
>> 
>>
>> Are these problems with openmpi and is there any known work arounds?
>>
>>
>
> These are new to me.  The problem does seem to occur with OMPI's shared
> memory device; you might want to try a different 

Re: [OMPI users] Segfault when using valgrind

2009-07-09 Thread Jeff Squyres

On Jul 7, 2009, at 11:47 AM, Justin wrote:


(Sorry if this is posted twice, I sent the same email yesterday but it
never appeared on the list).



Sorry for the delay in replying.  FWIW, I got your original message as  
well.



Hi,  I am attempting to debug a memory corruption in an mpi program
using valgrind.  However, when I run with valgrind I get semi-random
segfaults and valgrind messages with the openmpi library.  Here is an
example of such a seg fault:

==6153==
==6153== Invalid read of size 8
==6153==at 0x19102EA0: (within /usr/lib/openmpi/lib/openmpi/
mca_btl_sm.so)


...

==6153==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
^G^G^GThread "main"(pid 6153) caught signal SIGSEGV at address (nil)
(segmentation violation)

Looking at the code for our isend at SFC.h:298 does not seem to have  
any

errors:

=
  MergeInfo myinfo,theirinfo;

  MPI_Request srequest, rrequest;
  MPI_Status status;

  myinfo.n=n;
  if(n!=0)
  {
myinfo.min=sendbuf[0].bits;
myinfo.max=sendbuf[n-1].bits;
  }
  //cout << rank << " n:" << n << " min:" << (int)myinfo.min << "max:"
<< (int)myinfo.max << endl;

  MPI_Isend(,sizeof(MergeInfo),MPI_BYTE,to, 
0,Comm,);

==

myinfo is a struct located on the stack, to is the rank of the  
processor

that the message is being sent to, and srequest is also on the stack.
In addition this message is waited on prior to exiting this block of
code so they still exist on the stack.  When I don't run with valgrind
my program runs past this point just fine.



Strange.  I can't think of an immediate reason as to why this would  
happen -- does it also happen if you use a blocking send (vs. an  
Isend)?  Is myinfo a complex object, or a variable-length object?



I am currently using openmpi 1.3 from the debian unstable branch.  I
also see the same type of segfault in a different portion of the code
involving an MPI_Allgather which can be seen below:

==
==22736== Use of uninitialised value of size 8
==22736==at 0x19104775: mca_btl_sm_component_progress  
(opal_list.h:322)

==22736==by 0x1382CE09: opal_progress (opal_progress.c:207)
==22736==by 0xB404264: ompi_request_default_wait_all  
(condition.h:99)

==22736==by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual
(coll_tuned_util.c:55)
==22736==by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck
(coll_tuned_util.h:60)
==22736==by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
==22736==by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
==22736==by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc: 
537)

==22736==by 0x6465457:
Uintah::Grid::problemSetup(Uintah::Handle const&,
Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
==22736==by 0x8345759: Uintah::SimulationController::gridSetup()
(SimulationController.cc:243)
==22736==by 0x834F418: Uintah::AMRSimulationController::run()
(AMRSimulationController.cc:117)
==22736==by 0x4089AE: main (sus.cc:629)
==22736==
==22736== Invalid read of size 8
==22736==at 0x19104775: mca_btl_sm_component_progress  
(opal_list.h:322)

==22736==by 0x1382CE09: opal_progress (opal_progress.c:207)
==22736==by 0xB404264: ompi_request_default_wait_all  
(condition.h:99)

==22736==by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual
(coll_tuned_util.c:55)
==22736==by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck
(coll_tuned_util.h:60)
==22736==by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
==22736==by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
==22736==by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc: 
537)

==22736==by 0x6465457:
Uintah::Grid::problemSetup(Uintah::Handle const&,
Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
==22736==by 0x8345759: Uintah::SimulationController::gridSetup()
(SimulationController.cc:243)
==22736==by 0x834F418: Uintah::AMRSimulationController::run()
(AMRSimulationController.cc:117)
==22736==by 0x4089AE: main (sus.cc:629)


Are these problems with openmpi and is there any known work arounds?




These are new to me.  The problem does seem to occur with OMPI's  
shared memory device; you might want to try a different point-to-point  
device (e.g., tcp?) to see if the problem goes away.  But be aware  
that the problem "going away" does not really pinpoint the location of  
the problem -- moving to a slower transport (like tcp) may simply  
change timing such that the problem does not occur.  I.e., the problem  
could still exist in either your code or OMPI -- this would simply be  
a workaround.


--
Jeff Squyres
Cisco Systems



[OMPI users] Segfault when using valgrind

2009-07-07 Thread Justin
(Sorry if this is posted twice, I sent the same email yesterday but it 
never appeared on the list).



Hi,  I am attempting to debug a memory corruption in an mpi program 
using valgrind.  However, when I run with valgrind I get semi-random 
segfaults and valgrind messages with the openmpi library.  Here is an 
example of such a seg fault:


==6153==
==6153== Invalid read of size 8
==6153==at 0x19102EA0: (within /usr/lib/openmpi/lib/openmpi/
mca_btl_sm.so)
==6153==by 0x182ABACB: (within 
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
==6153==by 0x182A3040: (within 
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
==6153==by 0xB425DD3: PMPI_Isend (in 
/usr/lib/openmpi/lib/libmpi.so.0.0.0)
==6153==by 0x7B83DA8: int 
Uintah::SFC::MergeExchange(int, 
std::vector >&, 
std::vector >&, 
std::vector >&) (SFC.h:2989)
==6153==by 0x7B84A8F: void Uintah::SFC::Batcherschar>(std::vector >&, 
std::vector >&, 
std::vector >&) (SFC.h:3730)
==6153==by 0x7B8857B: void Uintah::SFC::Cleanupchar>(std::vector >&, 
std::vector >&, 
std::vector >&) (SFC.h:3695)
==6153==by 0x7B88CC6: void Uintah::SFC::Parallel0<3, 
unsigned char>() (SFC.h:2928)
==6153==by 0x7C00AAB: void Uintah::SFC::Parallel<3, unsigned 
char>() (SFC.h:1108)
==6153==by 0x7C0EF39: void Uintah::SFC::GenerateDim<3>(int) 
(SFC.h:694)
==6153==by 0x7C0F0F2: Uintah::SFC::GenerateCurve(int) 
(SFC.h:670)
==6153==by 0x7B30CAC: 
Uintah::DynamicLoadBalancer::useSFC(Uintah::Handle 
const&, int*) (DynamicLoadBalancer.cc:429)

==6153==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
^G^G^GThread "main"(pid 6153) caught signal SIGSEGV at address (nil) 
(segmentation violation)


Looking at the code for our isend at SFC.h:298 does not seem to have any 
errors: 


=
 MergeInfo myinfo,theirinfo;

 MPI_Request srequest, rrequest;
 MPI_Status status;

 myinfo.n=n;
 if(n!=0)
 {
   myinfo.min=sendbuf[0].bits;
   myinfo.max=sendbuf[n-1].bits;
 }
 //cout << rank << " n:" << n << " min:" << (int)myinfo.min << "max:" 
<< (int)myinfo.max << endl;


 MPI_Isend(,sizeof(MergeInfo),MPI_BYTE,to,0,Comm,);
==

myinfo is a struct located on the stack, to is the rank of the processor 
that the message is being sent to, and srequest is also on the stack.  
In addition this message is waited on prior to exiting this block of 
code so they still exist on the stack.  When I don't run with valgrind 
my program runs past this point just fine. 

I am currently using openmpi 1.3 from the debian unstable branch.  I 
also see the same type of segfault in a different portion of the code 
involving an MPI_Allgather which can be seen below:


==
==22736== Use of uninitialised value of size 8
==22736==at 0x19104775: mca_btl_sm_component_progress (opal_list.h:322)
==22736==by 0x1382CE09: opal_progress (opal_progress.c:207)
==22736==by 0xB404264: ompi_request_default_wait_all (condition.h:99)
==22736==by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual 
(coll_tuned_util.c:55)
==22736==by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck 
(coll_tuned_util.h:60)

==22736==by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
==22736==by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
==22736==by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)
==22736==by 0x6465457: 
Uintah::Grid::problemSetup(Uintah::Handle const&, 
Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
==22736==by 0x8345759: Uintah::SimulationController::gridSetup() 
(SimulationController.cc:243)
==22736==by 0x834F418: Uintah::AMRSimulationController::run() 
(AMRSimulationController.cc:117)

==22736==by 0x4089AE: main (sus.cc:629)
==22736==
==22736== Invalid read of size 8
==22736==at 0x19104775: mca_btl_sm_component_progress (opal_list.h:322)
==22736==by 0x1382CE09: opal_progress (opal_progress.c:207)
==22736==by 0xB404264: ompi_request_default_wait_all (condition.h:99)
==22736==by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual 
(coll_tuned_util.c:55)
==22736==by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck 
(coll_tuned_util.h:60)

==22736==by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
==22736==by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
==22736==by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)
==22736==by 0x6465457: 
Uintah::Grid::problemSetup(Uintah::Handle const&, 
Uintah::ProcessorGroup const*, 

[OMPI users] Segfault when using valgrind

2009-07-06 Thread Justin Luitjens
Hi,  I am attempting to debug a memory corruption in an mpi program using
valgrind.  Howver, when I run with valgrind I get semi-random segfaults and
valgrind messages with the openmpi library.  Here is an example of such a
seg fault:

==6153==
==6153== Invalid read of size 8
==6153==at 0x19102EA0: (within
/usr/lib/openmpi/lib/openmpi/mca_btl_sm.so)
==6153==by 0x182ABACB: (within
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
==6153==by 0x182A3040: (within
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
==6153==by 0xB425DD3: PMPI_Isend (in
/usr/lib/openmpi/lib/libmpi.so.0.0.0)
==6153==by 0x7B83DA8: int Uintah::SFC::MergeExchange(int, std::vector >&,
std::vector >&,
std::vector >&) (SFC.h:2989)
==6153==by 0x7B84A8F: void Uintah::SFC::Batchers(std::vector >&,
std::vector >&,
std::vector >&) (SFC.h:3730)
==6153==by 0x7B8857B: void Uintah::SFC::Cleanup(std::vector >&,
std::vector >&,
std::vector >&) (SFC.h:3695)
==6153==by 0x7B88CC6: void Uintah::SFC::Parallel0<3, unsigned
char>() (SFC.h:2928)
==6153==by 0x7C00AAB: void Uintah::SFC::Parallel<3, unsigned
char>() (SFC.h:1108)
==6153==by 0x7C0EF39: void Uintah::SFC::GenerateDim<3>(int)
(SFC.h:694)
==6153==by 0x7C0F0F2: Uintah::SFC::GenerateCurve(int)
(SFC.h:670)
==6153==by 0x7B30CAC:
Uintah::DynamicLoadBalancer::useSFC(Uintah::Handle const&,
int*) (DynamicLoadBalancer.cc:429)
==6153==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
^G^G^GThread "main"(pid 6153) caught signal SIGSEGV at address (nil)
(segmentation violation)

Looking at the code for our isend at SFC.h:298 does not seem to have any
errors:

=
  MergeInfo myinfo,theirinfo;

  MPI_Request srequest, rrequest;
  MPI_Status status;

  myinfo.n=n;
  if(n!=0)
  {
myinfo.min=sendbuf[0].bits;
myinfo.max=sendbuf[n-1].bits;
  }
  //cout << rank << " n:" << n << " min:" << (int)myinfo.min << "max:" <<
(int)myinfo.max << endl;

  MPI_Isend(,sizeof(MergeInfo),MPI_BYTE,to,0,Comm,);
==

myinfo is a struct located on the stack, to is the rank of the processor
that the message is being sent to, and srequest is also on the stack.  When
I don't run with valgrind my program runs past this point just fine.

I am currently using openmpi 1.3 from the debian unstable branch.  I also
see the same type of segfault in a different portion of the code involving
an MPI_Allgather which can be seen below:

==
==22736== Use of uninitialised value of size 8
==22736==at 0x19104775: mca_btl_sm_component_progress (opal_list.h:322)
==22736==by 0x1382CE09: opal_progress (opal_progress.c:207)
==22736==by 0xB404264: ompi_request_default_wait_all (condition.h:99)
==22736==by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual
(coll_tuned_util.c:55)
==22736==by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck
(coll_tuned_util.h:60)
==22736==by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
==22736==by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
==22736==by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)
==22736==by 0x6465457:
Uintah::Grid::problemSetup(Uintah::Handle const&,
Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
==22736==by 0x8345759: Uintah::SimulationController::gridSetup()
(SimulationController.cc:243)
==22736==by 0x834F418: Uintah::AMRSimulationController::run()
(AMRSimulationController.cc:117)
==22736==by 0x4089AE: main (sus.cc:629)
==22736==
==22736== Invalid read of size 8
==22736==at 0x19104775: mca_btl_sm_component_progress (opal_list.h:322)
==22736==by 0x1382CE09: opal_progress (opal_progress.c:207)
==22736==by 0xB404264: ompi_request_default_wait_all (condition.h:99)
==22736==by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual
(coll_tuned_util.c:55)
==22736==by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck
(coll_tuned_util.h:60)
==22736==by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
==22736==by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
==22736==by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)
==22736==by 0x6465457:
Uintah::Grid::problemSetup(Uintah::Handle const&,
Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
==22736==by 0x8345759: Uintah::SimulationController::gridSetup()
(SimulationController.cc:243)
==22736==by 0x834F418: Uintah::AMRSimulationController::run()
(AMRSimulationController.cc:117)
==22736==by 0x4089AE: main (sus.cc:629)