On May 16, 2011, at 12:45 PM, Peter Thompson wrote: > Hi Ralph, > > We've had a number of user complaints about this. Since it seems on the > face of it that it is a debugger issue, it may have not made it's way back > here. Is your objection that the patch basically aborts if it gets a bad > value? I could understand that being a concern. Of course, it aborts on > TotalView now if we attempt to move forward without this patch. >
No - my concern is that you appear to be removing the "putenv" calls. OMPI places some values into the local environment so the user can control behavior. Removing those causes problems. What I need to know is why, after it has worked with TV for years, these putenv's are suddenly a problem. Is the problem occurring during shutdown? Or is this something that causes TV to break? > I've passed your comment back to the engineer, with a suspicion about the > concerns about the abort, but if you have other objections, let me know. > > Cheers, > PeterT > > > Ralph Castain wrote: >> That would be a problem, I fear. We need to push those envars into the >> environment. >> >> Is there some particular problem causing what you see? We have no other >> reports of this issue, and orterun has had that code forever. >> >> >> >> Sent from my iPad >> >> On May 11, 2011, at 2:05 PM, Peter Thompson <peter.thomp...@roguewave.com> >> wrote: >> >> >>> We've gotten a few reports of problems with memory debugging when using >>> OpenMPI under TotalView. Usually, TotalView will attach tot he processes >>> started after an MPI_Init. However in the case where memory debugging is >>> enabled, things seemed to run away or fail. My analysis showed that we >>> had a number of core files left over from the attempt, and all were mpirun >>> (or orterun) cores. It seemed to be a regression on our part, since >>> testing seemed to indicate this worked okay before TotalView 8.9.0-0, so I >>> filed an internal bug and passed it to engineering. After giving our >>> engineer a brief tutorial on how to build a debug version of OpenMPI, he >>> found what appears to be a problem in the code for orterun.c. He's made a >>> slight change that fixes the issue in 1.4.2, 1.4.3, 1.4.4rc2 and 1.5.3, >>> those being the versions he's tested with so far. He doesn't subscribe >>> to this list that I know of, so I offered to pass this by the group. Of >>> course, I'm not sure if this is exactly the right place to submit patches, >>> but I'm sure you'd tell me where to put it if I'm in the wrong here. It's >>> a short patch, so I'll cut and paste it, and attach as well, since cut and >>> paste can do weird things to formatting. >>> >>> Credit goes to Ariel Burton for this patch. Of course he used TotalVIew to >>> find this ;-) It shows up if you do 'mpirun -tv -np 4 ./foo' or >>> 'totalview mpirun -a -np 4 ./foo' >>> >>> Cheers, >>> PeterT >>> >>> >>> more ~/patches/anbs-patch >>> *** orte/tools/orterun/orterun.c 2010-04-13 13:30:34.000000000 -0400 >>> --- >>> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../. >>> ./src/openmpi-1.4.2/orte/tools/orterun/orterun.c 2011-05-09 >>> 20:28:16.5881 >>> 83000 -0400 >>> *************** >>> *** 1578,1588 **** >>> } >>> if (NULL != env) { >>> size1 = opal_argv_count(env); >>> for (j = 0; j < size1; ++j) { >>> ! putenv(env[j]); >>> } >>> } >>> /* All done */ >>> --- 1578,1600 ---- >>> } >>> if (NULL != env) { >>> size1 = opal_argv_count(env); >>> for (j = 0; j < size1; ++j) { >>> ! /* Use-after-Free error possible here. putenv does not copy >>> ! the string passed to it, and instead stores only the >>> pointer. >>> ! env[j] may be freed later, in which case the pointer >>> ! in environ will now be left dangling into a deallocated >>> ! region. >>> ! So we make a copy of the variable. >>> ! */ >>> ! char *s = strdup(env[j]); >>> ! >>> ! if (NULL == s) { >>> ! return OPAL_ERR_OUT_OF_RESOURCE; >>> ! } >>> ! putenv(s); >>> } >>> } >>> /* All done */ >>> >>> *** orte/tools/orterun/orterun.c 2010-04-13 13:30:34.000000000 -0400 >>> --- >>> /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c >>> 2011-05-09 20:28:16.588183000 -0400 >>> *************** >>> *** 1578,1588 **** >>> } >>> >>> if (NULL != env) { >>> size1 = opal_argv_count(env); >>> for (j = 0; j < size1; ++j) { >>> ! putenv(env[j]); >>> } >>> } >>> >>> /* All done */ >>> >>> --- 1578,1600 ---- >>> } >>> >>> if (NULL != env) { >>> size1 = opal_argv_count(env); >>> for (j = 0; j < size1; ++j) { >>> ! /* Use-after-Free error possible here. putenv does not copy >>> ! the string passed to it, and instead stores only the >>> pointer. >>> ! env[j] may be freed later, in which case the pointer >>> ! in environ will now be left dangling into a deallocated >>> ! region. >>> ! So we make a copy of the variable. >>> ! */ >>> ! char *s = strdup(env[j]); >>> ! ! if (NULL == s) { >>> ! return OPAL_ERR_OUT_OF_RESOURCE; >>> ! } >>> ! putenv(s); >>> } >>> } >>> >>> /* All done */ >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >