Re: [OMPI devel] ORTE registry patch

2007-05-29 Thread Ralph Castain
After some work off-list with Tim, it appears that something has been broken
again on the OMPI trunk with respect to comm_spawn. It was working two weeks
ago, but...sigh.

Anyway, it doesn't appear to have any bearing either way on George's
patch(es), so whomever wants to commit them is welcome to do so.

Thanks
Ralph


On 5/29/07 11:44 AM, "Ralph Castain"  wrote:

> 
> 
> 
> On 5/29/07 11:02 AM, "Tim Prins"  wrote:
> 
>> Well, after fixing many of the tests...
> 
> Interesting - they worked fine for me. Perhaps a difference in environment.
> 
>> It passes all the tests
>> except the spawn tests. However, the spawn tests are seriously broken
>> without this patch as well, and the ibm mpi spawn tests seem to work
>> fine.
> 
> Then something is seriously wrong. The spawn tests were working as of my
> last commit - that is a test I religiously run. If the spawn test here
> doesn't work, then it is hard to understand how the mpi spawn can work since
> the call is identical.
> 
> Let me see what's wrong first...
> 
>> 
>> As far as I'm concerned, this should assuage any fear of problems
>> with these changes and they should now go in.
>> 
>> Tim
>> 
>> On May 29, 2007, at 11:34 AM, Ralph Castain wrote:
>> 
>>> Well, I'll be the voice of caution again...
>>> 
>>> Tim: did you run all of the orte tests in the orte/test/system
>>> directory? If
>>> so, and they all run correctly, then I have no issue with doing the
>>> commit.
>>> If not, then I would ask that we not do the commit until that has
>>> been done.
>>> 
>>> In running those tests, you need to run them on a multi-node
>>> system, both
>>> using mpirun and as singletons (you'll have to look at the tests to
>>> see
>>> which ones make sense in the latter case). This will ensure that we
>>> have at
>>> least some degree of coverage.
>>> 
>>> Thanks
>>> Ralph
>>> 
>>> 
>>> 
>>> On 5/29/07 9:23 AM, "George Bosilca"  wrote:
>>> 
 I'd be happy to commit the patch into the trunk. But after what
 happened last time, I'm more than cautious. If the community think
 the patch is worth having it, let me know and I'll push it in the
 trunk asap.
 
Thanks,
  george.
 
 On May 29, 2007, at 10:56 AM, Tim Prins wrote:
 
> I think both patches should be put in immediately. I have done some
> simple testing, and with 128 nodes of odin, with 1024 processes
> running mpi hello, these decrease our running time from about 14.2
> seconds to 10.9 seconds. This is a significant decrease, and as the
> scale increases there should be increasing benefit.
> 
> I'd be happy to commit these changes if no one objects.
> 
> Tim
> 
> On May 24, 2007, at 8:39 AM, Ralph H Castain wrote:
> 
>> Thanks - I'll take a look at this (and the prior ones!) in the next
>> couple
>> of weeks when time permits and get back to you.
>> 
>> Ralph
>> 
>> 
>> On 5/23/07 1:11 PM, "George Bosilca"  wrote:
>> 
>>> Attached is another patch to the ORTE layer, more specifically the
>>> replica. The idea is to decrease the number of strcmp by using a
>>> small hash function before doing the strcmp. The hask key for each
>>> registry entry is computed when it is added to the registry. When
>>> we're doing a query, instead of comparing the 2 strings we first
>>> check if the hash key match, and if they do match then we compare
>>> the
>>> 2 strings in order to make sure we eliminate collisions from our
>>> answers.
>>> 
>>> There is some benefit in terms of performance. It's hardly visible
>>> for few processes, but it start showing up when the number of
>>> processes increase. In fact the number of strcmp in the trace file
>>> drastically decrease. The main reason it works well, is because
>>> most
>>> of the keys start with basically the same chars (such as orte-
>>> blahblah) which transform the strcmp on a loop over few chars.
>>> 
>>> Ralph, please consider it for inclusion on the ORTE layer.
>>> 
>>>Thanks,
>>>  george.
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> 

Re: [OMPI devel] [OMPI svn] svn:open-mpi r14789

2007-05-29 Thread Anya Tatashina

good :)

thank you,
Anya

Ralph Castain wrote:


My apologies - I reviewed the opal code and see that it wasn't what I
remembered. The objection here is that strncmp returns an int, not a
pointer, and hence the NULL is bothering the compiler.

I'll submit the fix

Thanks
Ralph



On 5/29/07 11:56 AM, "Ralph Castain"  wrote:

 


Hmmm...well, this operation

  if ( NULL == strncmp(appctx->app, OPAL_PATH_SEP, 1 )) {

is not a legal one since OPAL_PATH_SEP is a character and not a
NULL-terminated string and thus generates the following warning:

totalview.c: In function 'orte_totalview_init_after_spawn':
totalview.c:412: warning: comparison between pointer and integer

Could you modify it?

Thanks
Ralph



On 5/29/07 11:39 AM, "a...@osl.iu.edu"  wrote:

   


Author: Anya
Date: 2007-05-29 13:39:11 EDT (Tue, 29 May 2007)
New Revision: 14789
URL: https://svn.open-mpi.org/trac/ompi/changeset/14789

Log:
Ref Trac #1032; added suport for full path launching with TotalView
Text files modified:
  trunk/orte/tools/orterun/totalview.c |11 +--
  1 files changed, 9 insertions(+), 2 deletions(-)

Modified: trunk/orte/tools/orterun/totalview.c

 


=>>
=
 


--- trunk/orte/tools/orterun/totalview.c (original)
+++ trunk/orte/tools/orterun/totalview.c 2007-05-29 13:39:11 EDT (Tue, 29 May
2007)
@@ -10,6 +10,7 @@
 * University of Stuttgart.  All rights reserved.
 * Copyright (c) 2004-2005 The Regents of the University of California.
 * All rights reserved.
+ * Copyright (c) 2007  Sun Microsystems, Inc. All rights reserved.
 * $COPYRIGHT$
 * 
 * Additional copyrights may follow

@@ -408,8 +409,14 @@
appctx = map->apps[proc->app_idx];
   
MPIR_proctable[i].host_name = strdup(node->nodename);

-MPIR_proctable[i].executable_name =
-opal_os_path( false, appctx->cwd, appctx->app, NULL );
+if ( NULL == strncmp(appctx->app, OPAL_PATH_SEP, 1 )) {
+   MPIR_proctable[i].executable_name =
+ opal_os_path( false, appctx->app, NULL );
+}
+else {
+   MPIR_proctable[i].executable_name =
+ opal_os_path( true, appctx->app, NULL );
+}
MPIR_proctable[i].pid = proc->pid;
i++;
}
___
svn mailing list
s...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn
 


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
   




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

 



Re: [OMPI devel] undefined environ symbol on Darwin

2007-05-29 Thread Brian Barrett

On May 29, 2007, at 7:35 AM, Jack Howarth wrote:


and if you see environ undefined, identify which library
it is in and which object file it came from. I would also
note that my patch reveals that several instances of the
environ variable being declared that are missing the Windows
wrappers. So if anything, adding the Darwin patch will
increase the probability that both targets are properly
maintained.


Yes, there are significant portions of the code base that are "Unix- 
only" and not built on Windows.  There are regular builds of Open MPI  
on Windows to ensure that problems are quickly resolved when they  
creep into the code base.  The places where the Windows environ fixes  
are missing are likely that way because they are in parts of the code  
that doesn't build on Windows.


As I've said, I'd be happy to commit a Mac OS X-specific fix for the  
environ problem if we can find a test case where it fails without the  
fix.  I'm not going to commit portability fixes to Open MPI for a  
problem that we can't replicate.  Based on what Peter said on the  
apple list, there is no problem with having an undefined symbol in a  
shared library (other than the fact that *that* shared library must  
be built with a flat namespace).  I'm working with someone here to  
get ParaView built on my Mac so I can trace down the problem and  
figure out if Open MPI is responsible for the missing symbol.


Brian


Re: [OMPI devel] [OMPI svn] svn:open-mpi r14789

2007-05-29 Thread Ralph Castain
My apologies - I reviewed the opal code and see that it wasn't what I
remembered. The objection here is that strncmp returns an int, not a
pointer, and hence the NULL is bothering the compiler.

I'll submit the fix

Thanks
Ralph



On 5/29/07 11:56 AM, "Ralph Castain"  wrote:

> Hmmm...well, this operation
> 
>if ( NULL == strncmp(appctx->app, OPAL_PATH_SEP, 1 )) {
> 
> is not a legal one since OPAL_PATH_SEP is a character and not a
> NULL-terminated string and thus generates the following warning:
> 
> totalview.c: In function 'orte_totalview_init_after_spawn':
> totalview.c:412: warning: comparison between pointer and integer
> 
> Could you modify it?
> 
> Thanks
> Ralph
> 
> 
> 
> On 5/29/07 11:39 AM, "a...@osl.iu.edu"  wrote:
> 
>> Author: Anya
>> Date: 2007-05-29 13:39:11 EDT (Tue, 29 May 2007)
>> New Revision: 14789
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/14789
>> 
>> Log:
>> Ref Trac #1032; added suport for full path launching with TotalView
>> Text files modified:
>>trunk/orte/tools/orterun/totalview.c |11 +--
>>1 files changed, 9 insertions(+), 2 deletions(-)
>> 
>> Modified: trunk/orte/tools/orterun/totalview.c
>> 
=>>
=
>> --- trunk/orte/tools/orterun/totalview.c (original)
>> +++ trunk/orte/tools/orterun/totalview.c 2007-05-29 13:39:11 EDT (Tue, 29 May
>> 2007)
>> @@ -10,6 +10,7 @@
>>   * University of Stuttgart.  All rights reserved.
>>   * Copyright (c) 2004-2005 The Regents of the University of California.
>>   * All rights reserved.
>> + * Copyright (c) 2007  Sun Microsystems, Inc. All rights reserved.
>>   * $COPYRIGHT$
>>   * 
>>   * Additional copyrights may follow
>> @@ -408,8 +409,14 @@
>>  appctx = map->apps[proc->app_idx];
>> 
>>  MPIR_proctable[i].host_name = strdup(node->nodename);
>> -MPIR_proctable[i].executable_name =
>> -opal_os_path( false, appctx->cwd, appctx->app, NULL );
>> +if ( NULL == strncmp(appctx->app, OPAL_PATH_SEP, 1 )) {
>> +   MPIR_proctable[i].executable_name =
>> + opal_os_path( false, appctx->app, NULL );
>> +}
>> +else {
>> +   MPIR_proctable[i].executable_name =
>> + opal_os_path( true, appctx->app, NULL );
>> +}
>>  MPIR_proctable[i].pid = proc->pid;
>>  i++;
>>  }
>> ___
>> svn mailing list
>> s...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] ORTE registry patch

2007-05-29 Thread Tim Prins
Well, after fixing many of the tests... It passes all the tests  
except the spawn tests. However, the spawn tests are seriously broken  
without this patch as well, and the ibm mpi spawn tests seem to work  
fine.


As far as I'm concerned, this should assuage any fear of problems  
with these changes and they should now go in.


Tim

On May 29, 2007, at 11:34 AM, Ralph Castain wrote:


Well, I'll be the voice of caution again...

Tim: did you run all of the orte tests in the orte/test/system  
directory? If
so, and they all run correctly, then I have no issue with doing the  
commit.
If not, then I would ask that we not do the commit until that has  
been done.


In running those tests, you need to run them on a multi-node  
system, both
using mpirun and as singletons (you'll have to look at the tests to  
see
which ones make sense in the latter case). This will ensure that we  
have at

least some degree of coverage.

Thanks
Ralph



On 5/29/07 9:23 AM, "George Bosilca"  wrote:


I'd be happy to commit the patch into the trunk. But after what
happened last time, I'm more than cautious. If the community think
the patch is worth having it, let me know and I'll push it in the
trunk asap.

   Thanks,
 george.

On May 29, 2007, at 10:56 AM, Tim Prins wrote:


I think both patches should be put in immediately. I have done some
simple testing, and with 128 nodes of odin, with 1024 processes
running mpi hello, these decrease our running time from about 14.2
seconds to 10.9 seconds. This is a significant decrease, and as the
scale increases there should be increasing benefit.

I'd be happy to commit these changes if no one objects.

Tim

On May 24, 2007, at 8:39 AM, Ralph H Castain wrote:


Thanks - I'll take a look at this (and the prior ones!) in the next
couple
of weeks when time permits and get back to you.

Ralph


On 5/23/07 1:11 PM, "George Bosilca"  wrote:


Attached is another patch to the ORTE layer, more specifically the
replica. The idea is to decrease the number of strcmp by using a
small hash function before doing the strcmp. The hask key for each
registry entry is computed when it is added to the registry. When
we're doing a query, instead of comparing the 2 strings we first
check if the hash key match, and if they do match then we compare
the
2 strings in order to make sure we eliminate collisions from our
answers.

There is some benefit in terms of performance. It's hardly visible
for few processes, but it start showing up when the number of
processes increase. In fact the number of strcmp in the trace file
drastically decrease. The main reason it works well, is because  
most

of the keys start with basically the same chars (such as orte-
blahblah) which transform the strcmp on a loop over few chars.

Ralph, please consider it for inclusion on the ORTE layer.

   Thanks,
 george.


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] ORTE registry patch

2007-05-29 Thread Ralph Castain
Well, I'll be the voice of caution again...

Tim: did you run all of the orte tests in the orte/test/system directory? If
so, and they all run correctly, then I have no issue with doing the commit.
If not, then I would ask that we not do the commit until that has been done.

In running those tests, you need to run them on a multi-node system, both
using mpirun and as singletons (you'll have to look at the tests to see
which ones make sense in the latter case). This will ensure that we have at
least some degree of coverage.

Thanks
Ralph



On 5/29/07 9:23 AM, "George Bosilca"  wrote:

> I'd be happy to commit the patch into the trunk. But after what
> happened last time, I'm more than cautious. If the community think
> the patch is worth having it, let me know and I'll push it in the
> trunk asap.
> 
>Thanks,
>  george.
> 
> On May 29, 2007, at 10:56 AM, Tim Prins wrote:
> 
>> I think both patches should be put in immediately. I have done some
>> simple testing, and with 128 nodes of odin, with 1024 processes
>> running mpi hello, these decrease our running time from about 14.2
>> seconds to 10.9 seconds. This is a significant decrease, and as the
>> scale increases there should be increasing benefit.
>> 
>> I'd be happy to commit these changes if no one objects.
>> 
>> Tim
>> 
>> On May 24, 2007, at 8:39 AM, Ralph H Castain wrote:
>> 
>>> Thanks - I'll take a look at this (and the prior ones!) in the next
>>> couple
>>> of weeks when time permits and get back to you.
>>> 
>>> Ralph
>>> 
>>> 
>>> On 5/23/07 1:11 PM, "George Bosilca"  wrote:
>>> 
 Attached is another patch to the ORTE layer, more specifically the
 replica. The idea is to decrease the number of strcmp by using a
 small hash function before doing the strcmp. The hask key for each
 registry entry is computed when it is added to the registry. When
 we're doing a query, instead of comparing the 2 strings we first
 check if the hash key match, and if they do match then we compare
 the
 2 strings in order to make sure we eliminate collisions from our
 answers.
 
 There is some benefit in terms of performance. It's hardly visible
 for few processes, but it start showing up when the number of
 processes increase. In fact the number of strcmp in the trace file
 drastically decrease. The main reason it works well, is because most
 of the keys start with basically the same chars (such as orte-
 blahblah) which transform the strcmp on a loop over few chars.
 
 Ralph, please consider it for inclusion on the ORTE layer.
 
Thanks,
  george.
 
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] ORTE registry patch

2007-05-29 Thread Tim Prins
I think both patches should be put in immediately. I have done some  
simple testing, and with 128 nodes of odin, with 1024 processes  
running mpi hello, these decrease our running time from about 14.2  
seconds to 10.9 seconds. This is a significant decrease, and as the  
scale increases there should be increasing benefit.


I'd be happy to commit these changes if no one objects.

Tim

On May 24, 2007, at 8:39 AM, Ralph H Castain wrote:

Thanks - I'll take a look at this (and the prior ones!) in the next  
couple

of weeks when time permits and get back to you.

Ralph


On 5/23/07 1:11 PM, "George Bosilca"  wrote:


Attached is another patch to the ORTE layer, more specifically the
replica. The idea is to decrease the number of strcmp by using a
small hash function before doing the strcmp. The hask key for each
registry entry is computed when it is added to the registry. When
we're doing a query, instead of comparing the 2 strings we first
check if the hash key match, and if they do match then we compare the
2 strings in order to make sure we eliminate collisions from our
answers.

There is some benefit in terms of performance. It's hardly visible
for few processes, but it start showing up when the number of
processes increase. In fact the number of strcmp in the trace file
drastically decrease. The main reason it works well, is because most
of the keys start with basically the same chars (such as orte-
blahblah) which transform the strcmp on a loop over few chars.

Ralph, please consider it for inclusion on the ORTE layer.

   Thanks,
 george.


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] undefined environ symbol on Darwin

2007-05-29 Thread Jack Howarth
Ralph,
I don't know why such patches should be harder to
maintain in Windows than on Darwin. All you do is
the equivalent to...

cd /sw/lib/openmpi/lib
nm * | grep environ

and if you see environ undefined, identify which library
it is in and which object file it came from. I would also
note that my patch reveals that several instances of the
environ variable being declared that are missing the Windows
wrappers. So if anything, adding the Darwin patch will
increase the probability that both targets are properly
maintained.
 Jack


Re: [OMPI devel] undefined environ symbol on Darwin

2007-05-29 Thread Jack Howarth
Brian,
Below is the response I got from the fink developer
who was trying to package paraview and ran into this
problem with the undefined environ symbol in openmpi.
I have also emailed on darwin-dev to get a clarification
on this issue. However your argument against target
specific chances is rather ironic considering openmpi
already has wrappers for Windows on those lines of code.
I've attached the error that is seen while compiling paraview. 
 Jack

---

I tried several patches to the paraview build scripts in order to get
rid of this undefined symbol, but I was not successful, one reason being
that the build process uses cmake which I find much less transparent
than the usual configure/libtool combination.

While looking through the build script, I see now that in fact there is
a difference between the linker command for this library
libicet_mpi.dylib and many other dylibs built in paraview, for example
libvtkParallel.pv2.6.dylib. Those use an explicit
-Wl,-flat_namespace,-U,_environ linker flag. Could it be that this is
there in order to avoid exactly this problem? I don't know. I'll look
some more where this comes from.

Here the error:

cd
/sw_unstable/src/fink.build/paraview-mpi-openmpi-2.6.1-1001/paraview-mpi-openmpi-darwin/Utilities/IceT/src/communication
&& /sw/bin/cmake -E cmake_link_script CMakeFiles/icet_mpi.dir/link.txt
--verbose=1
/sw/bin/gcc-O3 -DNDEBUG -dynamiclib -headerpad_max_install_names
-Wl,-search_paths_first -o ../../../../bin/libicet_mpi.dylib
-install_name /sw/lib/paraview-2.6/libicet_mpi.dylib
"CMakeFiles/icet_mpi.dir/mpi.o"
-L/sw_unstable/src/fink.build/paraview-mpi-openmpi-2.6.1-1001/paraview-mpi-openmpi-darwin/bin
-L/usr/X11R6/lib -lGLU -L/usr/X11R6/lib -lGL -L/sw/lib/openmpi -lmpi
-lmpi_cxx -licet -L/usr/X11R6/lib -lGLU -L/usr/X11R6/lib -lGL
-L/sw/lib/openmpi -lmpi -lmpi_cxx
ld: Undefined symbols:
_environ
/usr/bin/libtool: internal link edit command failed
make[2]: *** [bin/libicet_mpi.dylib] Error 1




Re: [OMPI devel] undefined environ symbol on Darwin

2007-05-29 Thread Ralph Castain
I am no expert on this issue. However, having watched those attempting to
maintain the Windows port using the same approach you are advocating, I can
say that requiring APPLE-specific "if...include" logic almost certainly will
be an exercise in frustration, if not futility. Many of us actually develop
Open MPI code on the Mac and have never seen this problem - which indicates
(to me, at least) that we will have great trouble "policing" this
requirement even within that sub-group of the project.

And as for the rest of the developers who work on Linux and other systems -
asking them to "please remember that Darwin requires some strange mojo" is
something we can do, but (based on experience with Windows) is very unlikely
to happen.

IF a behind-the-scenes solution can be found, that would be far more
preferable. Otherwise, we will likely find us patching releases regularly as
someone else discovers a missing "if APPLE" somewhere in the code.

Alternatively, of course, Darwin could just conform to the rest of the Unix
world... ;-)  (just kidding, of course...)

Ralph





On 5/28/07 8:19 PM, "Jack Howarth"  wrote:

> Brian,
> One general example of why openmpi shouldn't be creating
> shared libraries with undefined environ symbols is as follows.
> If a python based application was using the openmpi shared libraries
> linked into the application's python module, your suggested
> approach would be unusable since the user would have to rebuild
> python to explicitly provide the missing environ symbol. You
> will always run into such corner cases as long as openmpi is
> misbuilt on darwin.
>  Jack
> ps I have posted this issue to Apple's radar bug reporter
> as issue 5233061 as well.
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel