On May 16, 2011, at 12:45 PM, Peter Thompson wrote:

> Hi Ralph,
> 
> We've had a number of user complaints about this.   Since it seems on the 
> face of it that it is a debugger issue, it may have not made it's way back 
> here.  Is your objection that the patch basically aborts if it gets a bad 
> value?   I could understand that being a concern.   Of course, it aborts on 
> TotalView now if we attempt to move forward without this patch.
> 

No - my concern is that you appear to be removing the "putenv" calls. OMPI 
places some values into the local environment so the user can control behavior. 
Removing those causes problems.

What I need to know is why, after it has worked with TV for years, these 
putenv's are suddenly a problem. Is the problem occurring during shutdown? Or 
is this something that causes TV to break?


> I've passed your comment back to the engineer, with a suspicion about the 
> concerns about the abort, but if you have other objections, let me know.
> 
> Cheers,
> PeterT
> 
> 
> Ralph Castain wrote:
>> That would be a problem, I fear. We need to push those envars into the 
>> environment.
>> 
>> Is there some particular problem causing what you see? We have no other 
>> reports of this issue, and orterun has had that code forever.
>> 
>> 
>> 
>> Sent from my iPad
>> 
>> On May 11, 2011, at 2:05 PM, Peter Thompson <peter.thomp...@roguewave.com> 
>> wrote:
>> 
>>  
>>> We've gotten a few reports of problems with memory debugging when using 
>>> OpenMPI under TotalView.  Usually, TotalView will attach tot he processes 
>>> started after an MPI_Init.  However in the case where memory debugging is 
>>> enabled, things seemed to run away or fail.   My analysis showed that we 
>>> had a number of core files left over from the attempt, and all were mpirun 
>>> (or orterun) cores.   It seemed to be a regression on our part, since 
>>> testing seemed to indicate this worked okay before TotalView 8.9.0-0, so I 
>>> filed an internal bug and passed it to engineering.   After giving our 
>>> engineer a brief tutorial on how to build a debug version of OpenMPI, he 
>>> found what appears to be a problem in the code for orterun.c.   He's made a 
>>> slight change that fixes the issue in 1.4.2, 1.4.3, 1.4.4rc2 and 1.5.3, 
>>> those being the versions he's tested with so far.    He doesn't subscribe 
>>> to this list that I know of, so I offered to pass this by the group.   Of 
>>> course, I'm not sure if this is exactly the right place to submit patches, 
>>> but I'm sure you'd tell me where to put it if I'm in the wrong here.   It's 
>>> a short patch, so I'll cut and paste it, and attach as well, since cut and 
>>> paste can do weird things to formatting.
>>> 
>>> Credit goes to Ariel Burton for this patch.  Of course he used TotalVIew to 
>>> find this ;-)  It shows up if you do 'mpirun -tv -np 4 ./foo'   or 
>>> 'totalview mpirun -a -np 4 ./foo'
>>> 
>>> Cheers,
>>> PeterT
>>> 
>>> 
>>> more ~/patches/anbs-patch
>>> *** orte/tools/orterun/orterun.c        2010-04-13 13:30:34.000000000 -0400
>>> --- 
>>> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../.
>>> ./src/openmpi-1.4.2/orte/tools/orterun/orterun.c        2011-05-09 
>>> 20:28:16.5881
>>> 83000 -0400
>>> ***************
>>> *** 1578,1588 ****
>>>    }
>>>    if (NULL != env) {
>>>        size1 = opal_argv_count(env);
>>>        for (j = 0; j < size1; ++j) {
>>> !             putenv(env[j]);
>>>        }
>>>    }
>>>    /* All done */
>>> --- 1578,1600 ----
>>>    }
>>>    if (NULL != env) {
>>>        size1 = opal_argv_count(env);
>>>        for (j = 0; j < size1; ++j) {
>>> !             /* Use-after-Free error possible here.  putenv does not copy
>>> !                the string passed to it, and instead stores only the 
>>> pointer.
>>> !                env[j] may be freed later, in which case the pointer
>>> !                in environ will now be left dangling into a deallocated
>>> !                region.
>>> !                So we make a copy of the variable.
>>> !             */
>>> !             char *s = strdup(env[j]);
>>> !
>>> !             if (NULL == s) {
>>> !                 return OPAL_ERR_OUT_OF_RESOURCE;
>>> !             }
>>> !             putenv(s);
>>>        }
>>>    }
>>>    /* All done */
>>> 
>>> *** orte/tools/orterun/orterun.c    2010-04-13 13:30:34.000000000 -0400
>>> --- 
>>> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c
>>>     2011-05-09 20:28:16.588183000 -0400
>>> ***************
>>> *** 1578,1588 ****
>>>     }
>>> 
>>>     if (NULL != env) {
>>>         size1 = opal_argv_count(env);
>>>         for (j = 0; j < size1; ++j) {
>>> !             putenv(env[j]);
>>>         }
>>>     }
>>> 
>>>     /* All done */
>>> 
>>> --- 1578,1600 ----
>>>     }
>>> 
>>>     if (NULL != env) {
>>>         size1 = opal_argv_count(env);
>>>         for (j = 0; j < size1; ++j) {
>>> !             /* Use-after-Free error possible here.  putenv does not copy
>>> !                the string passed to it, and instead stores only the 
>>> pointer.
>>> !                env[j] may be freed later, in which case the pointer
>>> !                in environ will now be left dangling into a deallocated
>>> !                region.
>>> !                So we make a copy of the variable.
>>> !             */
>>> !             char *s = strdup(env[j]);
>>> ! !             if (NULL == s) {
>>> !                 return OPAL_ERR_OUT_OF_RESOURCE;
>>> !             }
>>> !             putenv(s);
>>>         }
>>>     }
>>> 
>>>     /* All done */
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>    
> 


Reply via email to