Re: [OMPI users] TotalView Memory debugging and OpenMPI

2011-05-20 Thread Peter Thompson
Thanks Ralph.  I've seen the messages generated in b...@open-mpi.org, so 
I figured something was up!   I was going to provide the unified diff, 
but then ran into another issue in testing where we immediately ran into 
a seq fault, even with this fix.   It turns out that a pre-pending of 
/lib64 (and maybe /usr/lib64) to LD_LIBRARY_PATH works around that one 
though, so I don't think it's directly related, but it threw me off, 
along with the beta testing we're doing...


Cheers,
PeterT


Ralph Castain wrote:

Okay, I finally had time to parse this and fix it. Thanks!

On May 16, 2011, at 1:02 PM, Peter Thompson wrote:

  
Hmmm?  We're not removing the putenv() calls.  Just adding a strdup() beforehand, and then calling putenv() with the string duplicated from env[j].  Of course, if the strdup fails, then we bail out. 
As for why it's suddenly a problem, I'm not quite as certain.   The problem we do show is a double free, so someone has already freed that memory used by putenv(), and I do know that while that used to be just flagged as an event before, now we seem to be unable to continue past it.   Not sure if that is our change or a library/system change. 
PeterT



Ralph Castain wrote:


On May 16, 2011, at 12:45 PM, Peter Thompson wrote:

 
  

Hi Ralph,

We've had a number of user complaints about this.   Since it seems on the face 
of it that it is a debugger issue, it may have not made it's way back here.  Is 
your objection that the patch basically aborts if it gets a bad value?   I 
could understand that being a concern.   Of course, it aborts on TotalView now 
if we attempt to move forward without this patch.

   


No - my concern is that you appear to be removing the "putenv" calls. OMPI 
places some values into the local environment so the user can control behavior. Removing 
those causes problems.

What I need to know is why, after it has worked with TV for years, these 
putenv's are suddenly a problem. Is the problem occurring during shutdown? Or 
is this something that causes TV to break?


 
  

I've passed your comment back to the engineer, with a suspicion about the 
concerns about the abort, but if you have other objections, let me know.

Cheers,
PeterT


Ralph Castain wrote:
   


That would be a problem, I fear. We need to push those envars into the 
environment.

Is there some particular problem causing what you see? We have no other reports 
of this issue, and orterun has had that code forever.



Sent from my iPad

On May 11, 2011, at 2:05 PM, Peter Thompson  
wrote:

  
  

We've gotten a few reports of problems with memory debugging when using OpenMPI 
under TotalView.  Usually, TotalView will attach tot he processes started after 
an MPI_Init.  However in the case where memory debugging is enabled, things 
seemed to run away or fail.   My analysis showed that we had a number of core 
files left over from the attempt, and all were mpirun (or orterun) cores.   It 
seemed to be a regression on our part, since testing seemed to indicate this 
worked okay before TotalView 8.9.0-0, so I filed an internal bug and passed it 
to engineering.   After giving our engineer a brief tutorial on how to build a 
debug version of OpenMPI, he found what appears to be a problem in the code for 
orterun.c.   He's made a slight change that fixes the issue in 1.4.2, 1.4.3, 
1.4.4rc2 and 1.5.3, those being the versions he's tested with so far.He 
doesn't subscribe to this list that I know of, so I offered to pass this by the 
group.   Of course, I'm not sure if this is exactly the right place to submit 
patches, but I'm sure you'd tell me where to put it if I'm in the wrong here.   
It's a short patch, so I'll cut and paste it, and attach as well, since cut and 
paste can do weird things to formatting.

Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew to 
find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 'totalview 
mpirun -a -np 4 ./foo'

Cheers,
PeterT


more ~/patches/anbs-patch
*** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
--- /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 20:28:16.5881
83000 -0400
***
*** 1578,1588 
  }
  if (NULL != env) {
  size1 = opal_argv_count(env);
  for (j = 0; j < size1; ++j) {
! putenv(env[j]);
  }
  }
  /* All done */
--- 1578,1600 
  }
  if (NULL != env) {
  size1 = opal_argv_count(env);
  for (j = 0; j < size1; ++j) {
! /* Use-after-Free error possible here.  putenv does not copy
!the string passed to it, and instead stores only the pointer.
!env[j] may be freed later, in which case the pointer
!in environ will now be left dangling into a deallocated
!region.
!So we make a copy of the 

Re: [OMPI users] TotalView Memory debugging and OpenMPI

2011-05-18 Thread Ralph Castain
Okay, I finally had time to parse this and fix it. Thanks!

On May 16, 2011, at 1:02 PM, Peter Thompson wrote:

> Hmmm?  We're not removing the putenv() calls.  Just adding a strdup() 
> beforehand, and then calling putenv() with the string duplicated from env[j]. 
>  Of course, if the strdup fails, then we bail out. 
> As for why it's suddenly a problem, I'm not quite as certain.   The problem 
> we do show is a double free, so someone has already freed that memory used by 
> putenv(), and I do know that while that used to be just flagged as an event 
> before, now we seem to be unable to continue past it.   Not sure if that is 
> our change or a library/system change. 
> PeterT
> 
> 
> Ralph Castain wrote:
>> On May 16, 2011, at 12:45 PM, Peter Thompson wrote:
>> 
>>  
>>> Hi Ralph,
>>> 
>>> We've had a number of user complaints about this.   Since it seems on the 
>>> face of it that it is a debugger issue, it may have not made it's way back 
>>> here.  Is your objection that the patch basically aborts if it gets a bad 
>>> value?   I could understand that being a concern.   Of course, it aborts on 
>>> TotalView now if we attempt to move forward without this patch.
>>> 
>>>
>> 
>> No - my concern is that you appear to be removing the "putenv" calls. OMPI 
>> places some values into the local environment so the user can control 
>> behavior. Removing those causes problems.
>> 
>> What I need to know is why, after it has worked with TV for years, these 
>> putenv's are suddenly a problem. Is the problem occurring during shutdown? 
>> Or is this something that causes TV to break?
>> 
>> 
>>  
>>> I've passed your comment back to the engineer, with a suspicion about the 
>>> concerns about the abort, but if you have other objections, let me know.
>>> 
>>> Cheers,
>>> PeterT
>>> 
>>> 
>>> Ralph Castain wrote:
>>>
 That would be a problem, I fear. We need to push those envars into the 
 environment.
 
 Is there some particular problem causing what you see? We have no other 
 reports of this issue, and orterun has had that code forever.
 
 
 
 Sent from my iPad
 
 On May 11, 2011, at 2:05 PM, Peter Thompson  
 wrote:
 
   
> We've gotten a few reports of problems with memory debugging when using 
> OpenMPI under TotalView.  Usually, TotalView will attach tot he processes 
> started after an MPI_Init.  However in the case where memory debugging is 
> enabled, things seemed to run away or fail.   My analysis showed that we 
> had a number of core files left over from the attempt, and all were 
> mpirun (or orterun) cores.   It seemed to be a regression on our part, 
> since testing seemed to indicate this worked okay before TotalView 
> 8.9.0-0, so I filed an internal bug and passed it to engineering.   After 
> giving our engineer a brief tutorial on how to build a debug version of 
> OpenMPI, he found what appears to be a problem in the code for orterun.c. 
>   He's made a slight change that fixes the issue in 1.4.2, 1.4.3, 
> 1.4.4rc2 and 1.5.3, those being the versions he's tested with so far.
> He doesn't subscribe to this list that I know of, so I offered to pass 
> this by the group.   Of course, I'm not sure if this is exactly the right 
> place to submit patches, but I'm sure you'd tell me where to put it if 
> I'm in the wrong here.   It's a short patch, so I'll cut and paste it, 
> and attach as well, since cut and paste can do weird things to formatting.
> 
> Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew 
> to find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 
> 'totalview mpirun -a -np 4 ./foo'
> 
> Cheers,
> PeterT
> 
> 
> more ~/patches/anbs-patch
> *** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 
> -0400
> --- 
> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
> ./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 
> 20:28:16.5881
> 83000 -0400
> ***
> *** 1578,1588 
>   }
>   if (NULL != env) {
>   size1 = opal_argv_count(env);
>   for (j = 0; j < size1; ++j) {
> ! putenv(env[j]);
>   }
>   }
>   /* All done */
> --- 1578,1600 
>   }
>   if (NULL != env) {
>   size1 = opal_argv_count(env);
>   for (j = 0; j < size1; ++j) {
> ! /* Use-after-Free error possible here.  putenv does not copy
> !the string passed to it, and instead stores only the 
> pointer.
> !env[j] may be freed later, in which case the pointer
> !in environ will now be left dangling into a deallocated
> !region.
> !So we make a copy of the variable.
> ! 

Re: [OMPI users] TotalView Memory debugging and OpenMPI

2011-05-17 Thread Jeff Squyres
Can you send your diff in unified form?

On May 11, 2011, at 4:05 PM, Peter Thompson wrote:

> We've gotten a few reports of problems with memory debugging when using 
> OpenMPI under TotalView.  Usually, TotalView will attach tot he processes 
> started after an MPI_Init.  However in the case where memory debugging is 
> enabled, things seemed to run away or fail.   My analysis showed that we had 
> a number of core files left over from the attempt, and all were mpirun (or 
> orterun) cores.   It seemed to be a regression on our part, since testing 
> seemed to indicate this worked okay before TotalView 8.9.0-0, so I filed an 
> internal bug and passed it to engineering.   After giving our engineer a 
> brief tutorial on how to build a debug version of OpenMPI, he found what 
> appears to be a problem in the code for orterun.c.   He's made a slight 
> change that fixes the issue in 1.4.2, 1.4.3, 1.4.4rc2 and 1.5.3, those being 
> the versions he's tested with so far.He doesn't subscribe to this list 
> that I know of, so I offered to pass this by the group.   Of course, I'm not 
> sure if this is exactly the right place to submit patches, but I'm sure you'd 
> tell me where to put it if I'm in the wrong here.   It's a short patch, so 
> I'll cut and paste it, and attach as well, since cut and paste can do weird 
> things to formatting.
> 
> Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew to 
> find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 'totalview 
> mpirun -a -np 4 ./foo'
> 
> Cheers,
> PeterT
> 
> 
> more ~/patches/anbs-patch
> *** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
> --- 
> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
> ./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 
> 20:28:16.5881
> 83000 -0400
> ***
> *** 1578,1588 
> }
> if (NULL != env) {
> size1 = opal_argv_count(env);
> for (j = 0; j < size1; ++j) {
> ! putenv(env[j]);
> }
> }
> /* All done */
> --- 1578,1600 
> }
> if (NULL != env) {
> size1 = opal_argv_count(env);
> for (j = 0; j < size1; ++j) {
> ! /* Use-after-Free error possible here.  putenv does not copy
> !the string passed to it, and instead stores only the pointer.
> !env[j] may be freed later, in which case the pointer
> !in environ will now be left dangling into a deallocated
> !region.
> !So we make a copy of the variable.
> ! */
> ! char *s = strdup(env[j]);
> !
> ! if (NULL == s) {
> ! return OPAL_ERR_OUT_OF_RESOURCE;
> ! }
> ! putenv(s);
> }
> }
> /* All done */
> 
> *** orte/tools/orterun/orterun.c  2010-04-13 13:30:34.0 -0400
> --- 
> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c
>   2011-05-09 20:28:16.588183000 -0400
> ***
> *** 1578,1588 
>  }
> 
>  if (NULL != env) {
>  size1 = opal_argv_count(env);
>  for (j = 0; j < size1; ++j) {
> ! putenv(env[j]);
>  }
>  }
> 
>  /* All done */
> 
> --- 1578,1600 
>  }
> 
>  if (NULL != env) {
>  size1 = opal_argv_count(env);
>  for (j = 0; j < size1; ++j) {
> ! /* Use-after-Free error possible here.  putenv does not copy
> !the string passed to it, and instead stores only the pointer.
> !env[j] may be freed later, in which case the pointer
> !in environ will now be left dangling into a deallocated
> !region.
> !So we make a copy of the variable.
> ! */
> ! char *s = strdup(env[j]);
> ! 
> ! if (NULL == s) {
> ! return OPAL_ERR_OUT_OF_RESOURCE;
> ! }
> ! putenv(s);
>  }
>  }
> 
>  /* All done */
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] TotalView Memory debugging and OpenMPI

2011-05-16 Thread Ralph Castain
Guess I'm having trouble reading your diff...different notation than I'm used 
to seeing. I'll have to parse thru it when I have more time.


On May 16, 2011, at 1:02 PM, Peter Thompson wrote:

> Hmmm?  We're not removing the putenv() calls.  Just adding a strdup() 
> beforehand, and then calling putenv() with the string duplicated from env[j]. 
>  Of course, if the strdup fails, then we bail out. 
> As for why it's suddenly a problem, I'm not quite as certain.   The problem 
> we do show is a double free, so someone has already freed that memory used by 
> putenv(), and I do know that while that used to be just flagged as an event 
> before, now we seem to be unable to continue past it.   Not sure if that is 
> our change or a library/system change. 
> PeterT
> 
> 
> Ralph Castain wrote:
>> On May 16, 2011, at 12:45 PM, Peter Thompson wrote:
>> 
>>  
>>> Hi Ralph,
>>> 
>>> We've had a number of user complaints about this.   Since it seems on the 
>>> face of it that it is a debugger issue, it may have not made it's way back 
>>> here.  Is your objection that the patch basically aborts if it gets a bad 
>>> value?   I could understand that being a concern.   Of course, it aborts on 
>>> TotalView now if we attempt to move forward without this patch.
>>> 
>>>
>> 
>> No - my concern is that you appear to be removing the "putenv" calls. OMPI 
>> places some values into the local environment so the user can control 
>> behavior. Removing those causes problems.
>> 
>> What I need to know is why, after it has worked with TV for years, these 
>> putenv's are suddenly a problem. Is the problem occurring during shutdown? 
>> Or is this something that causes TV to break?
>> 
>> 
>>  
>>> I've passed your comment back to the engineer, with a suspicion about the 
>>> concerns about the abort, but if you have other objections, let me know.
>>> 
>>> Cheers,
>>> PeterT
>>> 
>>> 
>>> Ralph Castain wrote:
>>>
 That would be a problem, I fear. We need to push those envars into the 
 environment.
 
 Is there some particular problem causing what you see? We have no other 
 reports of this issue, and orterun has had that code forever.
 
 
 
 Sent from my iPad
 
 On May 11, 2011, at 2:05 PM, Peter Thompson  
 wrote:
 
   
> We've gotten a few reports of problems with memory debugging when using 
> OpenMPI under TotalView.  Usually, TotalView will attach tot he processes 
> started after an MPI_Init.  However in the case where memory debugging is 
> enabled, things seemed to run away or fail.   My analysis showed that we 
> had a number of core files left over from the attempt, and all were 
> mpirun (or orterun) cores.   It seemed to be a regression on our part, 
> since testing seemed to indicate this worked okay before TotalView 
> 8.9.0-0, so I filed an internal bug and passed it to engineering.   After 
> giving our engineer a brief tutorial on how to build a debug version of 
> OpenMPI, he found what appears to be a problem in the code for orterun.c. 
>   He's made a slight change that fixes the issue in 1.4.2, 1.4.3, 
> 1.4.4rc2 and 1.5.3, those being the versions he's tested with so far.
> He doesn't subscribe to this list that I know of, so I offered to pass 
> this by the group.   Of course, I'm not sure if this is exactly the right 
> place to submit patches, but I'm sure you'd tell me where to put it if 
> I'm in the wrong here.   It's a short patch, so I'll cut and paste it, 
> and attach as well, since cut and paste can do weird things to formatting.
> 
> Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew 
> to find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 
> 'totalview mpirun -a -np 4 ./foo'
> 
> Cheers,
> PeterT
> 
> 
> more ~/patches/anbs-patch
> *** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 
> -0400
> --- 
> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
> ./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 
> 20:28:16.5881
> 83000 -0400
> ***
> *** 1578,1588 
>   }
>   if (NULL != env) {
>   size1 = opal_argv_count(env);
>   for (j = 0; j < size1; ++j) {
> ! putenv(env[j]);
>   }
>   }
>   /* All done */
> --- 1578,1600 
>   }
>   if (NULL != env) {
>   size1 = opal_argv_count(env);
>   for (j = 0; j < size1; ++j) {
> ! /* Use-after-Free error possible here.  putenv does not copy
> !the string passed to it, and instead stores only the 
> pointer.
> !env[j] may be freed later, in which case the pointer
> !in environ will now be left dangling into a deallocated
> !

Re: [OMPI users] TotalView Memory debugging and OpenMPI

2011-05-16 Thread Peter Thompson
Hmmm?  We're not removing the putenv() calls.  Just adding a strdup() 
beforehand, and then calling putenv() with the string duplicated from 
env[j].  Of course, if the strdup fails, then we bail out. 

As for why it's suddenly a problem, I'm not quite as certain.   The 
problem we do show is a double free, so someone has already freed that 
memory used by putenv(), and I do know that while that used to be just 
flagged as an event before, now we seem to be unable to continue past 
it.   Not sure if that is our change or a library/system change. 


PeterT


Ralph Castain wrote:

On May 16, 2011, at 12:45 PM, Peter Thompson wrote:

  

Hi Ralph,

We've had a number of user complaints about this.   Since it seems on the face 
of it that it is a debugger issue, it may have not made it's way back here.  Is 
your objection that the patch basically aborts if it gets a bad value?   I 
could understand that being a concern.   Of course, it aborts on TotalView now 
if we attempt to move forward without this patch.




No - my concern is that you appear to be removing the "putenv" calls. OMPI 
places some values into the local environment so the user can control behavior. Removing 
those causes problems.

What I need to know is why, after it has worked with TV for years, these 
putenv's are suddenly a problem. Is the problem occurring during shutdown? Or 
is this something that causes TV to break?


  

I've passed your comment back to the engineer, with a suspicion about the 
concerns about the abort, but if you have other objections, let me know.

Cheers,
PeterT


Ralph Castain wrote:


That would be a problem, I fear. We need to push those envars into the 
environment.

Is there some particular problem causing what you see? We have no other reports 
of this issue, and orterun has had that code forever.



Sent from my iPad

On May 11, 2011, at 2:05 PM, Peter Thompson  
wrote:

 
  

We've gotten a few reports of problems with memory debugging when using OpenMPI 
under TotalView.  Usually, TotalView will attach tot he processes started after 
an MPI_Init.  However in the case where memory debugging is enabled, things 
seemed to run away or fail.   My analysis showed that we had a number of core 
files left over from the attempt, and all were mpirun (or orterun) cores.   It 
seemed to be a regression on our part, since testing seemed to indicate this 
worked okay before TotalView 8.9.0-0, so I filed an internal bug and passed it 
to engineering.   After giving our engineer a brief tutorial on how to build a 
debug version of OpenMPI, he found what appears to be a problem in the code for 
orterun.c.   He's made a slight change that fixes the issue in 1.4.2, 1.4.3, 
1.4.4rc2 and 1.5.3, those being the versions he's tested with so far.He 
doesn't subscribe to this list that I know of, so I offered to pass this by the 
group.   Of course, I'm not sure if this is exactly the right place to submit 
patches, but I'm sure you'd tell me where to put it if I'm in the wrong here.   
It's a short patch, so I'll cut and paste it, and attach as well, since cut and 
paste can do weird things to formatting.

Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew to 
find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 'totalview 
mpirun -a -np 4 ./foo'

Cheers,
PeterT


more ~/patches/anbs-patch
*** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
--- /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 20:28:16.5881
83000 -0400
***
*** 1578,1588 
   }
   if (NULL != env) {
   size1 = opal_argv_count(env);
   for (j = 0; j < size1; ++j) {
! putenv(env[j]);
   }
   }
   /* All done */
--- 1578,1600 
   }
   if (NULL != env) {
   size1 = opal_argv_count(env);
   for (j = 0; j < size1; ++j) {
! /* Use-after-Free error possible here.  putenv does not copy
!the string passed to it, and instead stores only the pointer.
!env[j] may be freed later, in which case the pointer
!in environ will now be left dangling into a deallocated
!region.
!So we make a copy of the variable.
! */
! char *s = strdup(env[j]);
!
! if (NULL == s) {
! return OPAL_ERR_OUT_OF_RESOURCE;
! }
! putenv(s);
   }
   }
   /* All done */

*** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
--- 
/home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c
2011-05-09 20:28:16.588183000 -0400
***
*** 1578,1588 
}

if (NULL != env) {
size1 = opal_argv_count(env);
for (j = 0; j < size1; ++j) {
! putenv(env[j]);
}
}

  

Re: [OMPI users] TotalView Memory debugging and OpenMPI

2011-05-16 Thread Ralph Castain

On May 16, 2011, at 12:45 PM, Peter Thompson wrote:

> Hi Ralph,
> 
> We've had a number of user complaints about this.   Since it seems on the 
> face of it that it is a debugger issue, it may have not made it's way back 
> here.  Is your objection that the patch basically aborts if it gets a bad 
> value?   I could understand that being a concern.   Of course, it aborts on 
> TotalView now if we attempt to move forward without this patch.
> 

No - my concern is that you appear to be removing the "putenv" calls. OMPI 
places some values into the local environment so the user can control behavior. 
Removing those causes problems.

What I need to know is why, after it has worked with TV for years, these 
putenv's are suddenly a problem. Is the problem occurring during shutdown? Or 
is this something that causes TV to break?


> I've passed your comment back to the engineer, with a suspicion about the 
> concerns about the abort, but if you have other objections, let me know.
> 
> Cheers,
> PeterT
> 
> 
> Ralph Castain wrote:
>> That would be a problem, I fear. We need to push those envars into the 
>> environment.
>> 
>> Is there some particular problem causing what you see? We have no other 
>> reports of this issue, and orterun has had that code forever.
>> 
>> 
>> 
>> Sent from my iPad
>> 
>> On May 11, 2011, at 2:05 PM, Peter Thompson  
>> wrote:
>> 
>>  
>>> We've gotten a few reports of problems with memory debugging when using 
>>> OpenMPI under TotalView.  Usually, TotalView will attach tot he processes 
>>> started after an MPI_Init.  However in the case where memory debugging is 
>>> enabled, things seemed to run away or fail.   My analysis showed that we 
>>> had a number of core files left over from the attempt, and all were mpirun 
>>> (or orterun) cores.   It seemed to be a regression on our part, since 
>>> testing seemed to indicate this worked okay before TotalView 8.9.0-0, so I 
>>> filed an internal bug and passed it to engineering.   After giving our 
>>> engineer a brief tutorial on how to build a debug version of OpenMPI, he 
>>> found what appears to be a problem in the code for orterun.c.   He's made a 
>>> slight change that fixes the issue in 1.4.2, 1.4.3, 1.4.4rc2 and 1.5.3, 
>>> those being the versions he's tested with so far.He doesn't subscribe 
>>> to this list that I know of, so I offered to pass this by the group.   Of 
>>> course, I'm not sure if this is exactly the right place to submit patches, 
>>> but I'm sure you'd tell me where to put it if I'm in the wrong here.   It's 
>>> a short patch, so I'll cut and paste it, and attach as well, since cut and 
>>> paste can do weird things to formatting.
>>> 
>>> Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew to 
>>> find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 
>>> 'totalview mpirun -a -np 4 ./foo'
>>> 
>>> Cheers,
>>> PeterT
>>> 
>>> 
>>> more ~/patches/anbs-patch
>>> *** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
>>> --- 
>>> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
>>> ./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 
>>> 20:28:16.5881
>>> 83000 -0400
>>> ***
>>> *** 1578,1588 
>>>}
>>>if (NULL != env) {
>>>size1 = opal_argv_count(env);
>>>for (j = 0; j < size1; ++j) {
>>> ! putenv(env[j]);
>>>}
>>>}
>>>/* All done */
>>> --- 1578,1600 
>>>}
>>>if (NULL != env) {
>>>size1 = opal_argv_count(env);
>>>for (j = 0; j < size1; ++j) {
>>> ! /* Use-after-Free error possible here.  putenv does not copy
>>> !the string passed to it, and instead stores only the 
>>> pointer.
>>> !env[j] may be freed later, in which case the pointer
>>> !in environ will now be left dangling into a deallocated
>>> !region.
>>> !So we make a copy of the variable.
>>> ! */
>>> ! char *s = strdup(env[j]);
>>> !
>>> ! if (NULL == s) {
>>> ! return OPAL_ERR_OUT_OF_RESOURCE;
>>> ! }
>>> ! putenv(s);
>>>}
>>>}
>>>/* All done */
>>> 
>>> *** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
>>> --- 
>>> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c
>>> 2011-05-09 20:28:16.588183000 -0400
>>> ***
>>> *** 1578,1588 
>>> }
>>> 
>>> if (NULL != env) {
>>> size1 = opal_argv_count(env);
>>> for (j = 0; j < size1; ++j) {
>>> ! putenv(env[j]);
>>> }
>>> }
>>> 
>>> /* All done */
>>> 
>>> --- 1578,1600 
>>> }
>>> 
>>> if (NULL != env) {
>>> size1 = opal_argv_count(env);
>>> for (j = 0; j < size1; ++j) {
>>> ! /* Use-after-Free 

Re: [OMPI users] TotalView Memory debugging and OpenMPI

2011-05-16 Thread Peter Thompson

Hi Ralph,

We've had a number of user complaints about this.   Since it seems on 
the face of it that it is a debugger issue, it may have not made it's 
way back here.  Is your objection that the patch basically aborts if it 
gets a bad value?   I could understand that being a concern.   Of 
course, it aborts on TotalView now if we attempt to move forward without 
this patch.


I've passed your comment back to the engineer, with a suspicion about 
the concerns about the abort, but if you have other objections, let me know.


Cheers,
PeterT


Ralph Castain wrote:

That would be a problem, I fear. We need to push those envars into the 
environment.

Is there some particular problem causing what you see? We have no other reports 
of this issue, and orterun has had that code forever.



Sent from my iPad

On May 11, 2011, at 2:05 PM, Peter Thompson  
wrote:

  

We've gotten a few reports of problems with memory debugging when using OpenMPI 
under TotalView.  Usually, TotalView will attach tot he processes started after 
an MPI_Init.  However in the case where memory debugging is enabled, things 
seemed to run away or fail.   My analysis showed that we had a number of core 
files left over from the attempt, and all were mpirun (or orterun) cores.   It 
seemed to be a regression on our part, since testing seemed to indicate this 
worked okay before TotalView 8.9.0-0, so I filed an internal bug and passed it 
to engineering.   After giving our engineer a brief tutorial on how to build a 
debug version of OpenMPI, he found what appears to be a problem in the code for 
orterun.c.   He's made a slight change that fixes the issue in 1.4.2, 1.4.3, 
1.4.4rc2 and 1.5.3, those being the versions he's tested with so far.He 
doesn't subscribe to this list that I know of, so I offered to pass this by the 
group.   Of course, I'm not sure if this is exactly the right place to submit 
patches, but I'm sure you'd tell me where to put it if I'm in the wrong here.   
It's a short patch, so I'll cut and paste it, and attach as well, since cut and 
paste can do weird things to formatting.

Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew to 
find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 'totalview 
mpirun -a -np 4 ./foo'

Cheers,
PeterT


more ~/patches/anbs-patch
*** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
--- /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 20:28:16.5881
83000 -0400
***
*** 1578,1588 
}
if (NULL != env) {
size1 = opal_argv_count(env);
for (j = 0; j < size1; ++j) {
! putenv(env[j]);
}
}
/* All done */
--- 1578,1600 
}
if (NULL != env) {
size1 = opal_argv_count(env);
for (j = 0; j < size1; ++j) {
! /* Use-after-Free error possible here.  putenv does not copy
!the string passed to it, and instead stores only the pointer.
!env[j] may be freed later, in which case the pointer
!in environ will now be left dangling into a deallocated
!region.
!So we make a copy of the variable.
! */
! char *s = strdup(env[j]);
!
! if (NULL == s) {
! return OPAL_ERR_OUT_OF_RESOURCE;
! }
! putenv(s);
}
}
/* All done */

*** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
--- 
/home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c
2011-05-09 20:28:16.588183000 -0400
***
*** 1578,1588 
 }

 if (NULL != env) {
 size1 = opal_argv_count(env);
 for (j = 0; j < size1; ++j) {
! putenv(env[j]);
 }
 }

 /* All done */

--- 1578,1600 
 }

 if (NULL != env) {
 size1 = opal_argv_count(env);
 for (j = 0; j < size1; ++j) {
! /* Use-after-Free error possible here.  putenv does not copy
!the string passed to it, and instead stores only the pointer.
!env[j] may be freed later, in which case the pointer
!in environ will now be left dangling into a deallocated
!region.
!So we make a copy of the variable.
! */
! char *s = strdup(env[j]);
! 
! if (NULL == s) {

! return OPAL_ERR_OUT_OF_RESOURCE;
! }
! putenv(s);
 }
 }

 /* All done */

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






Re: [OMPI users] TotalView Memory debugging and OpenMPI

2011-05-11 Thread Ralph Castain
That would be a problem, I fear. We need to push those envars into the 
environment.

Is there some particular problem causing what you see? We have no other reports 
of this issue, and orterun has had that code forever.



Sent from my iPad

On May 11, 2011, at 2:05 PM, Peter Thompson  
wrote:

> We've gotten a few reports of problems with memory debugging when using 
> OpenMPI under TotalView.  Usually, TotalView will attach tot he processes 
> started after an MPI_Init.  However in the case where memory debugging is 
> enabled, things seemed to run away or fail.   My analysis showed that we had 
> a number of core files left over from the attempt, and all were mpirun (or 
> orterun) cores.   It seemed to be a regression on our part, since testing 
> seemed to indicate this worked okay before TotalView 8.9.0-0, so I filed an 
> internal bug and passed it to engineering.   After giving our engineer a 
> brief tutorial on how to build a debug version of OpenMPI, he found what 
> appears to be a problem in the code for orterun.c.   He's made a slight 
> change that fixes the issue in 1.4.2, 1.4.3, 1.4.4rc2 and 1.5.3, those being 
> the versions he's tested with so far.He doesn't subscribe to this list 
> that I know of, so I offered to pass this by the group.   Of course, I'm not 
> sure if this is exactly the right place to submit patches, but I'm sure you'd 
> tell me where to put it if I'm in the wrong here.   It's a short patch, so 
> I'll cut and paste it, and attach as well, since cut and paste can do weird 
> things to formatting.
> 
> Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew to 
> find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 'totalview 
> mpirun -a -np 4 ./foo'
> 
> Cheers,
> PeterT
> 
> 
> more ~/patches/anbs-patch
> *** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
> --- 
> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
> ./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 
> 20:28:16.5881
> 83000 -0400
> ***
> *** 1578,1588 
> }
> if (NULL != env) {
> size1 = opal_argv_count(env);
> for (j = 0; j < size1; ++j) {
> ! putenv(env[j]);
> }
> }
> /* All done */
> --- 1578,1600 
> }
> if (NULL != env) {
> size1 = opal_argv_count(env);
> for (j = 0; j < size1; ++j) {
> ! /* Use-after-Free error possible here.  putenv does not copy
> !the string passed to it, and instead stores only the pointer.
> !env[j] may be freed later, in which case the pointer
> !in environ will now be left dangling into a deallocated
> !region.
> !So we make a copy of the variable.
> ! */
> ! char *s = strdup(env[j]);
> !
> ! if (NULL == s) {
> ! return OPAL_ERR_OUT_OF_RESOURCE;
> ! }
> ! putenv(s);
> }
> }
> /* All done */
> 
> *** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
> --- 
> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c
> 2011-05-09 20:28:16.588183000 -0400
> ***
> *** 1578,1588 
>  }
> 
>  if (NULL != env) {
>  size1 = opal_argv_count(env);
>  for (j = 0; j < size1; ++j) {
> ! putenv(env[j]);
>  }
>  }
> 
>  /* All done */
> 
> --- 1578,1600 
>  }
> 
>  if (NULL != env) {
>  size1 = opal_argv_count(env);
>  for (j = 0; j < size1; ++j) {
> ! /* Use-after-Free error possible here.  putenv does not copy
> !the string passed to it, and instead stores only the pointer.
> !env[j] may be freed later, in which case the pointer
> !in environ will now be left dangling into a deallocated
> !region.
> !So we make a copy of the variable.
> ! */
> ! char *s = strdup(env[j]);
> ! 
> ! if (NULL == s) {
> ! return OPAL_ERR_OUT_OF_RESOURCE;
> ! }
> ! putenv(s);
>  }
>  }
> 
>  /* All done */
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] TotalView Memory debugging and OpenMPI

2011-05-11 Thread Peter Thompson
We've gotten a few reports of problems with memory debugging when using 
OpenMPI under TotalView.  Usually, TotalView will attach tot he 
processes started after an MPI_Init.  However in the case where memory 
debugging is enabled, things seemed to run away or fail.   My analysis 
showed that we had a number of core files left over from the attempt, 
and all were mpirun (or orterun) cores.   It seemed to be a regression 
on our part, since testing seemed to indicate this worked okay before 
TotalView 8.9.0-0, so I filed an internal bug and passed it to 
engineering.   After giving our engineer a brief tutorial on how to 
build a debug version of OpenMPI, he found what appears to be a problem 
in the code for orterun.c.   He's made a slight change that fixes the 
issue in 1.4.2, 1.4.3, 1.4.4rc2 and 1.5.3, those being the versions he's 
tested with so far.He doesn't subscribe to this list that I know of, 
so I offered to pass this by the group.   Of course, I'm not sure if 
this is exactly the right place to submit patches, but I'm sure you'd 
tell me where to put it if I'm in the wrong here.   It's a short patch, 
so I'll cut and paste it, and attach as well, since cut and paste can do 
weird things to formatting.


Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew 
to find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 
'totalview mpirun -a -np 4 ./foo'


Cheers,
PeterT


more ~/patches/anbs-patch
*** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
--- 
/home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 
20:28:16.5881

83000 -0400
***
*** 1578,1588 
 }

 if (NULL != env) {
 size1 = opal_argv_count(env);
 for (j = 0; j < size1; ++j) {
! putenv(env[j]);
 }
 }

 /* All done */

--- 1578,1600 
 }

 if (NULL != env) {
 size1 = opal_argv_count(env);
 for (j = 0; j < size1; ++j) {
! /* Use-after-Free error possible here.  putenv does not copy
!the string passed to it, and instead stores only the 
pointer.

!env[j] may be freed later, in which case the pointer
!in environ will now be left dangling into a deallocated
!region.
!So we make a copy of the variable.
! */
! char *s = strdup(env[j]);
!
! if (NULL == s) {
! return OPAL_ERR_OUT_OF_RESOURCE;
! }
! putenv(s);
 }
 }

 /* All done */


*** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400
--- 
/home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c
2011-05-09 20:28:16.588183000 -0400
***
*** 1578,1588 
  }

  if (NULL != env) {
  size1 = opal_argv_count(env);
  for (j = 0; j < size1; ++j) {
! putenv(env[j]);
  }
  }

  /* All done */

--- 1578,1600 
  }

  if (NULL != env) {
  size1 = opal_argv_count(env);
  for (j = 0; j < size1; ++j) {
! /* Use-after-Free error possible here.  putenv does not copy
!the string passed to it, and instead stores only the pointer.
!env[j] may be freed later, in which case the pointer
!in environ will now be left dangling into a deallocated
!region.
!So we make a copy of the variable.
! */
! char *s = strdup(env[j]);
! 
! if (NULL == s) {
! return OPAL_ERR_OUT_OF_RESOURCE;
! }
! putenv(s);
  }
  }

  /* All done */