Re: [OMPI devel] ORTE registry patch
After some work off-list with Tim, it appears that something has been broken again on the OMPI trunk with respect to comm_spawn. It was working two weeks ago, but...sigh. Anyway, it doesn't appear to have any bearing either way on George's patch(es), so whomever wants to commit them is welcome to do so. Thanks Ralph On 5/29/07 11:44 AM, "Ralph Castain"wrote: > > > > On 5/29/07 11:02 AM, "Tim Prins" wrote: > >> Well, after fixing many of the tests... > > Interesting - they worked fine for me. Perhaps a difference in environment. > >> It passes all the tests >> except the spawn tests. However, the spawn tests are seriously broken >> without this patch as well, and the ibm mpi spawn tests seem to work >> fine. > > Then something is seriously wrong. The spawn tests were working as of my > last commit - that is a test I religiously run. If the spawn test here > doesn't work, then it is hard to understand how the mpi spawn can work since > the call is identical. > > Let me see what's wrong first... > >> >> As far as I'm concerned, this should assuage any fear of problems >> with these changes and they should now go in. >> >> Tim >> >> On May 29, 2007, at 11:34 AM, Ralph Castain wrote: >> >>> Well, I'll be the voice of caution again... >>> >>> Tim: did you run all of the orte tests in the orte/test/system >>> directory? If >>> so, and they all run correctly, then I have no issue with doing the >>> commit. >>> If not, then I would ask that we not do the commit until that has >>> been done. >>> >>> In running those tests, you need to run them on a multi-node >>> system, both >>> using mpirun and as singletons (you'll have to look at the tests to >>> see >>> which ones make sense in the latter case). This will ensure that we >>> have at >>> least some degree of coverage. >>> >>> Thanks >>> Ralph >>> >>> >>> >>> On 5/29/07 9:23 AM, "George Bosilca" wrote: >>> I'd be happy to commit the patch into the trunk. But after what happened last time, I'm more than cautious. If the community think the patch is worth having it, let me know and I'll push it in the trunk asap. Thanks, george. On May 29, 2007, at 10:56 AM, Tim Prins wrote: > I think both patches should be put in immediately. I have done some > simple testing, and with 128 nodes of odin, with 1024 processes > running mpi hello, these decrease our running time from about 14.2 > seconds to 10.9 seconds. This is a significant decrease, and as the > scale increases there should be increasing benefit. > > I'd be happy to commit these changes if no one objects. > > Tim > > On May 24, 2007, at 8:39 AM, Ralph H Castain wrote: > >> Thanks - I'll take a look at this (and the prior ones!) in the next >> couple >> of weeks when time permits and get back to you. >> >> Ralph >> >> >> On 5/23/07 1:11 PM, "George Bosilca" wrote: >> >>> Attached is another patch to the ORTE layer, more specifically the >>> replica. The idea is to decrease the number of strcmp by using a >>> small hash function before doing the strcmp. The hask key for each >>> registry entry is computed when it is added to the registry. When >>> we're doing a query, instead of comparing the 2 strings we first >>> check if the hash key match, and if they do match then we compare >>> the >>> 2 strings in order to make sure we eliminate collisions from our >>> answers. >>> >>> There is some benefit in terms of performance. It's hardly visible >>> for few processes, but it start showing up when the number of >>> processes increase. In fact the number of strcmp in the trace file >>> drastically decrease. The main reason it works well, is because >>> most >>> of the keys start with basically the same chars (such as orte- >>> blahblah) which transform the strcmp on a loop over few chars. >>> >>> Ralph, please consider it for inclusion on the ORTE layer. >>> >>>Thanks, >>> george. >>> >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> ___ >>> devel mailing list >>>
Re: [OMPI devel] [OMPI svn] svn:open-mpi r14789
good :) thank you, Anya Ralph Castain wrote: My apologies - I reviewed the opal code and see that it wasn't what I remembered. The objection here is that strncmp returns an int, not a pointer, and hence the NULL is bothering the compiler. I'll submit the fix Thanks Ralph On 5/29/07 11:56 AM, "Ralph Castain"wrote: Hmmm...well, this operation if ( NULL == strncmp(appctx->app, OPAL_PATH_SEP, 1 )) { is not a legal one since OPAL_PATH_SEP is a character and not a NULL-terminated string and thus generates the following warning: totalview.c: In function 'orte_totalview_init_after_spawn': totalview.c:412: warning: comparison between pointer and integer Could you modify it? Thanks Ralph On 5/29/07 11:39 AM, "a...@osl.iu.edu" wrote: Author: Anya Date: 2007-05-29 13:39:11 EDT (Tue, 29 May 2007) New Revision: 14789 URL: https://svn.open-mpi.org/trac/ompi/changeset/14789 Log: Ref Trac #1032; added suport for full path launching with TotalView Text files modified: trunk/orte/tools/orterun/totalview.c |11 +-- 1 files changed, 9 insertions(+), 2 deletions(-) Modified: trunk/orte/tools/orterun/totalview.c =>> = --- trunk/orte/tools/orterun/totalview.c (original) +++ trunk/orte/tools/orterun/totalview.c 2007-05-29 13:39:11 EDT (Tue, 29 May 2007) @@ -10,6 +10,7 @@ * University of Stuttgart. All rights reserved. * Copyright (c) 2004-2005 The Regents of the University of California. * All rights reserved. + * Copyright (c) 2007 Sun Microsystems, Inc. All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -408,8 +409,14 @@ appctx = map->apps[proc->app_idx]; MPIR_proctable[i].host_name = strdup(node->nodename); -MPIR_proctable[i].executable_name = -opal_os_path( false, appctx->cwd, appctx->app, NULL ); +if ( NULL == strncmp(appctx->app, OPAL_PATH_SEP, 1 )) { + MPIR_proctable[i].executable_name = + opal_os_path( false, appctx->app, NULL ); +} +else { + MPIR_proctable[i].executable_name = + opal_os_path( true, appctx->app, NULL ); +} MPIR_proctable[i].pid = proc->pid; i++; } ___ svn mailing list s...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/svn ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] undefined environ symbol on Darwin
On May 29, 2007, at 7:35 AM, Jack Howarth wrote: and if you see environ undefined, identify which library it is in and which object file it came from. I would also note that my patch reveals that several instances of the environ variable being declared that are missing the Windows wrappers. So if anything, adding the Darwin patch will increase the probability that both targets are properly maintained. Yes, there are significant portions of the code base that are "Unix- only" and not built on Windows. There are regular builds of Open MPI on Windows to ensure that problems are quickly resolved when they creep into the code base. The places where the Windows environ fixes are missing are likely that way because they are in parts of the code that doesn't build on Windows. As I've said, I'd be happy to commit a Mac OS X-specific fix for the environ problem if we can find a test case where it fails without the fix. I'm not going to commit portability fixes to Open MPI for a problem that we can't replicate. Based on what Peter said on the apple list, there is no problem with having an undefined symbol in a shared library (other than the fact that *that* shared library must be built with a flat namespace). I'm working with someone here to get ParaView built on my Mac so I can trace down the problem and figure out if Open MPI is responsible for the missing symbol. Brian
Re: [OMPI devel] [OMPI svn] svn:open-mpi r14789
My apologies - I reviewed the opal code and see that it wasn't what I remembered. The objection here is that strncmp returns an int, not a pointer, and hence the NULL is bothering the compiler. I'll submit the fix Thanks Ralph On 5/29/07 11:56 AM, "Ralph Castain"wrote: > Hmmm...well, this operation > >if ( NULL == strncmp(appctx->app, OPAL_PATH_SEP, 1 )) { > > is not a legal one since OPAL_PATH_SEP is a character and not a > NULL-terminated string and thus generates the following warning: > > totalview.c: In function 'orte_totalview_init_after_spawn': > totalview.c:412: warning: comparison between pointer and integer > > Could you modify it? > > Thanks > Ralph > > > > On 5/29/07 11:39 AM, "a...@osl.iu.edu" wrote: > >> Author: Anya >> Date: 2007-05-29 13:39:11 EDT (Tue, 29 May 2007) >> New Revision: 14789 >> URL: https://svn.open-mpi.org/trac/ompi/changeset/14789 >> >> Log: >> Ref Trac #1032; added suport for full path launching with TotalView >> Text files modified: >>trunk/orte/tools/orterun/totalview.c |11 +-- >>1 files changed, 9 insertions(+), 2 deletions(-) >> >> Modified: trunk/orte/tools/orterun/totalview.c >> =>> = >> --- trunk/orte/tools/orterun/totalview.c (original) >> +++ trunk/orte/tools/orterun/totalview.c 2007-05-29 13:39:11 EDT (Tue, 29 May >> 2007) >> @@ -10,6 +10,7 @@ >> * University of Stuttgart. All rights reserved. >> * Copyright (c) 2004-2005 The Regents of the University of California. >> * All rights reserved. >> + * Copyright (c) 2007 Sun Microsystems, Inc. All rights reserved. >> * $COPYRIGHT$ >> * >> * Additional copyrights may follow >> @@ -408,8 +409,14 @@ >> appctx = map->apps[proc->app_idx]; >> >> MPIR_proctable[i].host_name = strdup(node->nodename); >> -MPIR_proctable[i].executable_name = >> -opal_os_path( false, appctx->cwd, appctx->app, NULL ); >> +if ( NULL == strncmp(appctx->app, OPAL_PATH_SEP, 1 )) { >> + MPIR_proctable[i].executable_name = >> + opal_os_path( false, appctx->app, NULL ); >> +} >> +else { >> + MPIR_proctable[i].executable_name = >> + opal_os_path( true, appctx->app, NULL ); >> +} >> MPIR_proctable[i].pid = proc->pid; >> i++; >> } >> ___ >> svn mailing list >> s...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/svn > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] ORTE registry patch
Well, after fixing many of the tests... It passes all the tests except the spawn tests. However, the spawn tests are seriously broken without this patch as well, and the ibm mpi spawn tests seem to work fine. As far as I'm concerned, this should assuage any fear of problems with these changes and they should now go in. Tim On May 29, 2007, at 11:34 AM, Ralph Castain wrote: Well, I'll be the voice of caution again... Tim: did you run all of the orte tests in the orte/test/system directory? If so, and they all run correctly, then I have no issue with doing the commit. If not, then I would ask that we not do the commit until that has been done. In running those tests, you need to run them on a multi-node system, both using mpirun and as singletons (you'll have to look at the tests to see which ones make sense in the latter case). This will ensure that we have at least some degree of coverage. Thanks Ralph On 5/29/07 9:23 AM, "George Bosilca"wrote: I'd be happy to commit the patch into the trunk. But after what happened last time, I'm more than cautious. If the community think the patch is worth having it, let me know and I'll push it in the trunk asap. Thanks, george. On May 29, 2007, at 10:56 AM, Tim Prins wrote: I think both patches should be put in immediately. I have done some simple testing, and with 128 nodes of odin, with 1024 processes running mpi hello, these decrease our running time from about 14.2 seconds to 10.9 seconds. This is a significant decrease, and as the scale increases there should be increasing benefit. I'd be happy to commit these changes if no one objects. Tim On May 24, 2007, at 8:39 AM, Ralph H Castain wrote: Thanks - I'll take a look at this (and the prior ones!) in the next couple of weeks when time permits and get back to you. Ralph On 5/23/07 1:11 PM, "George Bosilca" wrote: Attached is another patch to the ORTE layer, more specifically the replica. The idea is to decrease the number of strcmp by using a small hash function before doing the strcmp. The hask key for each registry entry is computed when it is added to the registry. When we're doing a query, instead of comparing the 2 strings we first check if the hash key match, and if they do match then we compare the 2 strings in order to make sure we eliminate collisions from our answers. There is some benefit in terms of performance. It's hardly visible for few processes, but it start showing up when the number of processes increase. In fact the number of strcmp in the trace file drastically decrease. The main reason it works well, is because most of the keys start with basically the same chars (such as orte- blahblah) which transform the strcmp on a loop over few chars. Ralph, please consider it for inclusion on the ORTE layer. Thanks, george. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] ORTE registry patch
Well, I'll be the voice of caution again... Tim: did you run all of the orte tests in the orte/test/system directory? If so, and they all run correctly, then I have no issue with doing the commit. If not, then I would ask that we not do the commit until that has been done. In running those tests, you need to run them on a multi-node system, both using mpirun and as singletons (you'll have to look at the tests to see which ones make sense in the latter case). This will ensure that we have at least some degree of coverage. Thanks Ralph On 5/29/07 9:23 AM, "George Bosilca"wrote: > I'd be happy to commit the patch into the trunk. But after what > happened last time, I'm more than cautious. If the community think > the patch is worth having it, let me know and I'll push it in the > trunk asap. > >Thanks, > george. > > On May 29, 2007, at 10:56 AM, Tim Prins wrote: > >> I think both patches should be put in immediately. I have done some >> simple testing, and with 128 nodes of odin, with 1024 processes >> running mpi hello, these decrease our running time from about 14.2 >> seconds to 10.9 seconds. This is a significant decrease, and as the >> scale increases there should be increasing benefit. >> >> I'd be happy to commit these changes if no one objects. >> >> Tim >> >> On May 24, 2007, at 8:39 AM, Ralph H Castain wrote: >> >>> Thanks - I'll take a look at this (and the prior ones!) in the next >>> couple >>> of weeks when time permits and get back to you. >>> >>> Ralph >>> >>> >>> On 5/23/07 1:11 PM, "George Bosilca" wrote: >>> Attached is another patch to the ORTE layer, more specifically the replica. The idea is to decrease the number of strcmp by using a small hash function before doing the strcmp. The hask key for each registry entry is computed when it is added to the registry. When we're doing a query, instead of comparing the 2 strings we first check if the hash key match, and if they do match then we compare the 2 strings in order to make sure we eliminate collisions from our answers. There is some benefit in terms of performance. It's hardly visible for few processes, but it start showing up when the number of processes increase. In fact the number of strcmp in the trace file drastically decrease. The main reason it works well, is because most of the keys start with basically the same chars (such as orte- blahblah) which transform the strcmp on a loop over few chars. Ralph, please consider it for inclusion on the ORTE layer. Thanks, george. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] ORTE registry patch
I think both patches should be put in immediately. I have done some simple testing, and with 128 nodes of odin, with 1024 processes running mpi hello, these decrease our running time from about 14.2 seconds to 10.9 seconds. This is a significant decrease, and as the scale increases there should be increasing benefit. I'd be happy to commit these changes if no one objects. Tim On May 24, 2007, at 8:39 AM, Ralph H Castain wrote: Thanks - I'll take a look at this (and the prior ones!) in the next couple of weeks when time permits and get back to you. Ralph On 5/23/07 1:11 PM, "George Bosilca"wrote: Attached is another patch to the ORTE layer, more specifically the replica. The idea is to decrease the number of strcmp by using a small hash function before doing the strcmp. The hask key for each registry entry is computed when it is added to the registry. When we're doing a query, instead of comparing the 2 strings we first check if the hash key match, and if they do match then we compare the 2 strings in order to make sure we eliminate collisions from our answers. There is some benefit in terms of performance. It's hardly visible for few processes, but it start showing up when the number of processes increase. In fact the number of strcmp in the trace file drastically decrease. The main reason it works well, is because most of the keys start with basically the same chars (such as orte- blahblah) which transform the strcmp on a loop over few chars. Ralph, please consider it for inclusion on the ORTE layer. Thanks, george. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] undefined environ symbol on Darwin
Ralph, I don't know why such patches should be harder to maintain in Windows than on Darwin. All you do is the equivalent to... cd /sw/lib/openmpi/lib nm * | grep environ and if you see environ undefined, identify which library it is in and which object file it came from. I would also note that my patch reveals that several instances of the environ variable being declared that are missing the Windows wrappers. So if anything, adding the Darwin patch will increase the probability that both targets are properly maintained. Jack
Re: [OMPI devel] undefined environ symbol on Darwin
Brian, Below is the response I got from the fink developer who was trying to package paraview and ran into this problem with the undefined environ symbol in openmpi. I have also emailed on darwin-dev to get a clarification on this issue. However your argument against target specific chances is rather ironic considering openmpi already has wrappers for Windows on those lines of code. I've attached the error that is seen while compiling paraview. Jack --- I tried several patches to the paraview build scripts in order to get rid of this undefined symbol, but I was not successful, one reason being that the build process uses cmake which I find much less transparent than the usual configure/libtool combination. While looking through the build script, I see now that in fact there is a difference between the linker command for this library libicet_mpi.dylib and many other dylibs built in paraview, for example libvtkParallel.pv2.6.dylib. Those use an explicit -Wl,-flat_namespace,-U,_environ linker flag. Could it be that this is there in order to avoid exactly this problem? I don't know. I'll look some more where this comes from. Here the error: cd /sw_unstable/src/fink.build/paraview-mpi-openmpi-2.6.1-1001/paraview-mpi-openmpi-darwin/Utilities/IceT/src/communication && /sw/bin/cmake -E cmake_link_script CMakeFiles/icet_mpi.dir/link.txt --verbose=1 /sw/bin/gcc-O3 -DNDEBUG -dynamiclib -headerpad_max_install_names -Wl,-search_paths_first -o ../../../../bin/libicet_mpi.dylib -install_name /sw/lib/paraview-2.6/libicet_mpi.dylib "CMakeFiles/icet_mpi.dir/mpi.o" -L/sw_unstable/src/fink.build/paraview-mpi-openmpi-2.6.1-1001/paraview-mpi-openmpi-darwin/bin -L/usr/X11R6/lib -lGLU -L/usr/X11R6/lib -lGL -L/sw/lib/openmpi -lmpi -lmpi_cxx -licet -L/usr/X11R6/lib -lGLU -L/usr/X11R6/lib -lGL -L/sw/lib/openmpi -lmpi -lmpi_cxx ld: Undefined symbols: _environ /usr/bin/libtool: internal link edit command failed make[2]: *** [bin/libicet_mpi.dylib] Error 1
Re: [OMPI devel] undefined environ symbol on Darwin
I am no expert on this issue. However, having watched those attempting to maintain the Windows port using the same approach you are advocating, I can say that requiring APPLE-specific "if...include" logic almost certainly will be an exercise in frustration, if not futility. Many of us actually develop Open MPI code on the Mac and have never seen this problem - which indicates (to me, at least) that we will have great trouble "policing" this requirement even within that sub-group of the project. And as for the rest of the developers who work on Linux and other systems - asking them to "please remember that Darwin requires some strange mojo" is something we can do, but (based on experience with Windows) is very unlikely to happen. IF a behind-the-scenes solution can be found, that would be far more preferable. Otherwise, we will likely find us patching releases regularly as someone else discovers a missing "if APPLE" somewhere in the code. Alternatively, of course, Darwin could just conform to the rest of the Unix world... ;-) (just kidding, of course...) Ralph On 5/28/07 8:19 PM, "Jack Howarth"wrote: > Brian, > One general example of why openmpi shouldn't be creating > shared libraries with undefined environ symbols is as follows. > If a python based application was using the openmpi shared libraries > linked into the application's python module, your suggested > approach would be unusable since the user would have to rebuild > python to explicitly provide the missing environ symbol. You > will always run into such corner cases as long as openmpi is > misbuilt on darwin. > Jack > ps I have posted this issue to Apple's radar bug reporter > as issue 5233061 as well. > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel