Re: [OMPI devel] Integrating the memchecker branch

2008-01-14 Thread Jeff Squyres
I have not had a chance to check out the tmp branch for this (I'm  
currently in an airport without network access), but it all sounds  
good in principle to me.  Forgive me if I've said these things before,  
but here's what I'd like to see if possible:


- configure output shows whether this stuff is enabled
  - e.g., does it check for the relevant macros in valgrind's header  
files?  (I assume so; I've totally forgotten...)  Ensure that these  
checks are output in configure's stdout


- ompi_info shows whether this stuff is enabled

- obvious user-level configure errors raise errors/abort configure  
(E.g., --enable-memchecker is specified but --enable-debug is not), or  
make some obvious assumptions about "what the user meant" (e.g., if -- 
enable-memchecker is specified by --enable-debug is not, then  
automatically enable --enable-debug and output a message saying so).


- I think we've said ad nauseam that there should be zero performance  
penalty for when this stuff is not enabled, and I'm guessing that this  
is still true.  :-)


- some kind of documentation should be written up about how to use  
this stuff, perhaps in the FAQ (e.g., pairing it with a valgrind- 
enabled libibverbs for max benefit, etc.).




If --enable-memchecker is on, --enable-debug should be on as well to  
make

sense



On Jan 8, 2008, at 3:11 PM, Rainer Keller wrote:


Hello dear all,

WHAT:
We would like to integrate the changes on the memchecker-branch to  
trunk, as

planned in the

WHY:
The checking offers memory checking for certain User and OMPI- 
internal errors,
like buffer overruns, size mismatches, checks for wrong send/receive  
buffers.


WHERE: OMPI trunk and v1.3 phase3

WHEN:
Integration into Trunk of memchecker branch: 25.1. (although off-by- 
default,

this leaves enough time before Feature Freeze on 8.2.)

TIMEOUT: None
===

The memchecker branch contains checks for memory buffer faults  
either in the

User-Code or in ompi-code itself.
It uses the valgrind-API to set/reset buffer validity of the user  
buffers
passed to the MPI-layer. Additionally ompi-internal datatypes are  
checked

for.
Both are configurable using the flags:
  --enable-memchecker
  --with-valgrind=DIR (if needed)

A decent/recent valgrind is needed (for getting and setting VBITS/ 
using the
newer macros). The valgrind-version is being checked for, at least  
version

3.2.0 is required.

The actual checking is done in the MPI-layer, in order not to trap any
(correct) access in the BTL, the user buffer is reset to accessible  
in the

PML-layer (currently OB1 -- others won't make much sense?).

The default behaviour is to *NOT* enable memchecker.
If it is enabled, but not valgrind is being run, the costs for the  
buffer
checks are minimal, the costs for each ompi-datastructure (like  
datatype, or

communicator passed) is not.
Further information regarding penalties and performance may be found  
in:

http://www.open-mpi.org/papers/parco-2007

Comments from the Paris meeting have been integrated.
Are there any objections or hints?

With best regards,
Shiqing and Rainer

PS: If --enable-memchecker is on, --enable-debug should be on as  
well to make

sense.
--

Dipl.-Inf. Rainer Keller   http://www.hlrs.de/people/keller
 HLRS  Tel: ++49 (0)711-685 6 5858
 Nobelstrasse 19  Fax: ++49 (0)711-685 6 5832
 70550 Stuttgartemail: kel...@hlrs.de
 Germany AIM/Skype:rusraink

"Emails save time, not printing them saves trees!"
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] SDP support for OPEN-MPI

2008-01-14 Thread Jeff Squyres

On Jan 13, 2008, at 8:19 AM, Lenny Verkhovsky wrote:


> What I meant was try to open an SDP socket.  If it fails because SDP
> is not supported / available to that peer, then open a regular
> socket.  So you should still always have only 1 socket open to a  
peer

> (not 2).
Yes, but since the listener side doesn't know on which socket to  
expect

a message it will need both sockets to be opened.



Ah, you meant the listener socket -- not 2 sockets to each peer.  Ok,  
fair enough.  Opening up one more listener socket in each process is  
no big deal (IMO).


> > If one of the machine is not supporting SDP user will get an  
error.

>
> Well, that's one way to go, but it's certainly less friendly.  It
> means that the entire MPI job has to support SDP -- including  
mpirun.

> What about clusters that do not have IB on the head node?
>
They can use OOB over IP sockets and BTL on SDP, it should work.



Yes, I'm fine with this -- IIRC, my point was that if SDP is not  
available (and the user didn't explicitly ask for it), then it should  
not be an error.


> >> Perhaps a more general approach would be to [perhaps  
additionally]

> >> provide an MCA param to allow the user to specify the AF_* value?
> >> (AF_INET_SDP is a standardized value, right?  I.e., will it be  
the

> >> same on all Linux variants [and someday Solaris]?)
> > I didn't find any standard on it, it seems to be "randomly"  
selected

> > since the originally it was 26 and changed to 27 due to conflict
with
> > kernel's defines.
>
> This might make an even stronger case for having an MCA param for it
> -- if the AF_INET_SDP value is so broken that it's effectively  
random,

> it may be necessary to override it on some platforms (especially in
> light of binary OMPI and OFED distributions that may not match).
>
If we talking about passing AF_INET_SDP value only then
1. Passing this value as mca parameter will not make any changes to  
the

SDP code.
2. Hopefully in the future AF_INET_SDP value can be gotten from the
libc,
And the value will be configured automatically.
3. If we are talking about AF_INET value in general ( IPv4, IPv6, SDP)
Then by making it constant with mca parameter we are limiting  
ourselves
for one protocol only without being able to failover or using  
different

protocols for different needs ( i.e. SDP for OOB and IPv4 for BTL )



I'm not sure what you mean.  The AF_INET values for v4 and v6 are  
constantly compiled into OMPI via whatever values they are in the  
system header files.  They're standardized values, right?


My understanding of what you were saying was that AF_INET_SDP is *not*  
standardized such that it may actually be different values on  
different systems.  Hence, an MPI app could be otherwise portable but  
have a wrong value for AF_INET_SDP compiled into its code.


Are you saying something else?


> >> Patrick's got a good point: is there a reason not to do this?
> >> (LD_PRELOAD and the like)  Is it problematic with the remote
orted's?
> > Yes, it's problematic with remote orted's and it not really
> > transparent
> > as you might think.
> > Since we can't pass environments' variables to the orted's during
> > runtime
>
> I think this depends on your environment.  If you're not using rsh
> (which you shouldn't be for a large cluster, which is where SDP  
would

> matter most, right?), the resource manager typically copies the
> environment out to the cluster nodes.  So an LD_PRELOAD value should
> be set for the orteds as well.
>
> I agree that it's problematic for rsh, but that might also be  
solvable
> (with some limits; there's only so many characters that we can  
pass on

> the command line -- we did investigate having a wrapper to the orted
> at one point to accept environment variables and then launch the
> orted, but this was so problematic / klunky that we abandoned the
idea).
>
Using LD_PRELOAD will not allow us to use SDP and IP separately, i.e.
SDP for OOB and IP for a BTL.



Why would you want to do that?  I would think that the biggest win  
here would be SDP for OOB -- the heck with the BTL.  The BTL was just  
done for completeness (right?); if you have OpenFabrics support, you  
should be using the verbs BTL.


Perhaps I don't understand exactly what you are proposing.  I was  
under the impression that you were going after a common case: mpirun  
and the MPI jobs are running on back-end compute nodes where all of  
them support SDP (although the other case of mpirun running on the  
head node without SDP and all the MPI processes are running on back- 
end nodes with SDP is also not-uncommon...).  Are you thinking of  
something else, or are you looking for more flexibility?


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] vt integration -- still problems on os x

2008-01-14 Thread Jeff Squyres
Truly weird; I am now unable to reproduce the problem as well.  Can  
you think of any dumb user-level error that I could have done to  
create this problem?  It was very repeatable the other day.  :-(


Oh well.  Barring any objections from Sun, I say that this stuff  
should *finally* be merged back up to the trunk (you guys have the  
patience of saints -- many thanks for all your work! :-) ).


One last tibit that would be nice to have fixed, though, is to set  
svn:ignore throughout the vt tree to properly ignore files so that an  
"svn status" doesn't turn up a bunch of "?" files that really should  
be ignored by SVN:


[8:09] beezle:~/svn/vt-integration/ompi/contrib/vt % svn st | egrep '^ 
\?' | wc

  90 1803149



On Jan 14, 2008, at 4:43 AM, Matthias Jurenz wrote:


Hi Jeff,

unfortunalety, also for this problem I need some more information,  
because I could

not reproduce this error on our Leopard...
Please add the option '-vt:verbose' to the compile command in order  
that I can see what
the VT's compiler wrapper do. Futhermore, could you send me the  
source file hello.c?


Thanks, Matthias

On Fr, 2008-01-11 at 13:18 -0500, Jeff Squyres wrote:


I am able to compile now on OS X -- great!

However, I seem to get some weird errors when running on Leopard:

[13:14] beezle:~/tmp/foo % mpicc-vt ../hello.c -o hello
[13:14] beezle:~/tmp/foo % nm hello > hello.nm
[13:14] beezle:~/tmp/foo % setenv VT_NMFILE ~/tmp/foo/hello.nm
[13:14] beezle:~/tmp/foo % mpirun -np 4 hello
Hello, world!
Hello, world!
Hello, world!
vtunify: Error: Could not open file ./a.1.uctl
Hello, world!

That's a weird one -- here's what the dir looks like:

[13:14] beezle:~/tmp/foo % ls -l
total 352
drwxrwxr-x   7 jsquyres  staff 238 Jan 11 13:14 ./
drwxrwxr-x  41 jsquyres  staff1394 Jan 11 13:14 ../
-rw-rw-r--   1 jsquyres  staff1601 Jan 11 13:14 a.0.def.z
-rw-rw-r--   1 jsquyres  staff  26 Jan 11 13:14 a.1.events.z
-rw-rw-r--   1 jsquyres  staff   4 Jan 11 13:14 a.otf
-rwxrwxr-x   1 jsquyres  staff  150336 Jan 11 13:14 hello*
-rw-rw-r--   1 jsquyres  staff   13266 Jan 11 13:14 hello.nm

Just for the heckuvit, let's try running again...

[13:14] beezle:~/tmp/foo % mpirun -np 4 hello
Hello, world!
Hello, world!
Hello, world!
Hello, world!
Assertion failed: (p_vecLocDefs->size() > 0), function createGlobal,
file vt_unify_defs.cc, line 508.
vtunify: Error: Could not open file ./a.1.uctl
[13:14] beezle:~/tmp/foo %

Yoinks -- an assertion failure...

Successive runs seems to be variations on these errors (the assertion
failure and various "could not open" and "could not remove" errors).



On Jan 11, 2008, at 11:45 AM, Matthias Jurenz wrote:

> Hi Jeff,
>
> I could reproduce the linker problem with the sf.net GCC. Thanks  
for

> your hint.
> A header include was missing for STL's functional objects. :-(
>
>
> Matthias
>
>
> On Do, 2008-01-10 at 13:21 -0500, Jeff Squyres wrote:
>>
>> On Jan 10, 2008, at 10:19 AM, Andreas Knüpfer wrote:
>>
>> > unfortunately, we're unable to reproduce this error. Could you  
pass

>> > some more
>> > information about your configure command line? This was done  
with

>> > gcc 4.2 on
>> > mac os X, wasn't it?
>>
>> I'm on Leopard on my MBP with:
>>
>> ./configure --prefix=/Users/jsquyres/bogus --enable-mpi-f90 --
>> without-
>> threads
>>
>> But I might see the problem here -- I just realized/remembered  
that
>> I'm using the sf.net GCC install (hpc.sf.net).  If I force /usr/ 
bin/

>> gcc (and friends), it seems to work:
>>
>> ./configure --prefix=/Users/jsquyres/bogus CC=/usr/bin/gcc CXX=/ 
usr/

>> bin/g++ --disable-mpi-fortran
>>
>> However, the hpc.sf.net OS X compilers are not uncommon (because  
they
>> provide fortran compiler support for OS X).  Do you think you'll  
be

>> able to test with these compilers?
>>
> --
> Matthias Jurenz,
> Center for Information Services and
> High Performance Computing (ZIH), TU Dresden,
> Willersbau A106, Zellescher Weg 12, 01062 Dresden
> phone +49-351-463-31945, fax +49-351-463-37773
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Matthias Jurenz,
Center for Information Services and
High Performance Computing (ZIH), TU Dresden,
Willersbau A106, Zellescher Weg 12, 01062 Dresden
phone +49-351-463-31945, fax +49-351-463-37773
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems




Re: [OMPI devel] [PATCH] openib btl: extensable cpcselection enablement

2008-01-14 Thread Gleb Natapov
On Mon, Jan 14, 2008 at 08:15:23AM -0500, Jeff Squyres (jsquyres) wrote:
> Any obj to bringing this stuff to the trunk?  The moden string opt stuff can 
> be done directly on the trunk imo.
Go ahead.

--
Gleb.


[OMPI devel] carto framework

2008-01-14 Thread Sharon Melamed
I added a new framework to Open MPI called carto. You can brows the code
in: http://svn.open-mpi.org/svn/ompi/tmp-public/carto/. 

 

There are some explanations about the carto framework in the project
wiki: https://svn.open-mpi.org/trac/ompi/wiki/OnHostTopologyDescription
and you can read the attached doc.

 

The carto framework can't do any damage because no one calls it. So, if
there are no objections, I would like to merge the carto framework to
the trunk.

 

Sharon. 



carto_framework_requirements.pdf
Description: carto_framework_requirements.pdf


Re: [OMPI devel] vt integration -- still problems on os x

2008-01-14 Thread Matthias Jurenz
Hi Jeff,

unfortunalety, also for this problem I need some more information,
because I could 
not reproduce this error on our Leopard...
Please add the option '-vt:verbose' to the compile command in order that
I can see what
the VT's compiler wrapper do. Futhermore, could you send me the source
file hello.c?

Thanks, Matthias

On Fr, 2008-01-11 at 13:18 -0500, Jeff Squyres wrote:

> I am able to compile now on OS X -- great!
> 
> However, I seem to get some weird errors when running on Leopard:
> 
> [13:14] beezle:~/tmp/foo % mpicc-vt ../hello.c -o hello
> [13:14] beezle:~/tmp/foo % nm hello > hello.nm
> [13:14] beezle:~/tmp/foo % setenv VT_NMFILE ~/tmp/foo/hello.nm
> [13:14] beezle:~/tmp/foo % mpirun -np 4 hello
> Hello, world!
> Hello, world!
> Hello, world!
> vtunify: Error: Could not open file ./a.1.uctl
> Hello, world!
> 
> That's a weird one -- here's what the dir looks like:
> 
> [13:14] beezle:~/tmp/foo % ls -l
> total 352
> drwxrwxr-x   7 jsquyres  staff 238 Jan 11 13:14 ./
> drwxrwxr-x  41 jsquyres  staff1394 Jan 11 13:14 ../
> -rw-rw-r--   1 jsquyres  staff1601 Jan 11 13:14 a.0.def.z
> -rw-rw-r--   1 jsquyres  staff  26 Jan 11 13:14 a.1.events.z
> -rw-rw-r--   1 jsquyres  staff   4 Jan 11 13:14 a.otf
> -rwxrwxr-x   1 jsquyres  staff  150336 Jan 11 13:14 hello*
> -rw-rw-r--   1 jsquyres  staff   13266 Jan 11 13:14 hello.nm
> 
> Just for the heckuvit, let's try running again...
> 
> [13:14] beezle:~/tmp/foo % mpirun -np 4 hello
> Hello, world!
> Hello, world!
> Hello, world!
> Hello, world!
> Assertion failed: (p_vecLocDefs->size() > 0), function createGlobal,  
> file vt_unify_defs.cc, line 508.
> vtunify: Error: Could not open file ./a.1.uctl
> [13:14] beezle:~/tmp/foo %
> 
> Yoinks -- an assertion failure...
> 
> Successive runs seems to be variations on these errors (the assertion  
> failure and various "could not open" and "could not remove" errors).
> 
> 
> 
> On Jan 11, 2008, at 11:45 AM, Matthias Jurenz wrote:
> 
> > Hi Jeff,
> >
> > I could reproduce the linker problem with the sf.net GCC. Thanks for  
> > your hint.
> > A header include was missing for STL's functional objects. :-(
> >
> >
> > Matthias
> >
> >
> > On Do, 2008-01-10 at 13:21 -0500, Jeff Squyres wrote:
> >>
> >> On Jan 10, 2008, at 10:19 AM, Andreas Knüpfer wrote:
> >>
> >> > unfortunately, we're unable to reproduce this error. Could you pass
> >> > some more
> >> > information about your configure command line? This was done with
> >> > gcc 4.2 on
> >> > mac os X, wasn't it?
> >>
> >> I'm on Leopard on my MBP with:
> >>
> >> ./configure --prefix=/Users/jsquyres/bogus --enable-mpi-f90 -- 
> >> without-
> >> threads
> >>
> >> But I might see the problem here -- I just realized/remembered that
> >> I'm using the sf.net GCC install (hpc.sf.net).  If I force /usr/bin/
> >> gcc (and friends), it seems to work:
> >>
> >> ./configure --prefix=/Users/jsquyres/bogus CC=/usr/bin/gcc CXX=/usr/
> >> bin/g++ --disable-mpi-fortran
> >>
> >> However, the hpc.sf.net OS X compilers are not uncommon (because they
> >> provide fortran compiler support for OS X).  Do you think you'll be
> >> able to test with these compilers?
> >>
> > --
> > Matthias Jurenz,
> > Center for Information Services and
> > High Performance Computing (ZIH), TU Dresden,
> > Willersbau A106, Zellescher Weg 12, 01062 Dresden
> > phone +49-351-463-31945, fax +49-351-463-37773
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 

--
Matthias Jurenz,
Center for Information Services and 
High Performance Computing (ZIH), TU Dresden, 
Willersbau A106, Zellescher Weg 12, 01062 Dresden
phone +49-351-463-31945, fax +49-351-463-37773


Re: [OMPI devel] [PATCH] openib btl: extensable cpc selection enablement

2008-01-14 Thread Pavel Shamis (Pasha)

Jon Mason wrote:
  

I have few machines with connectX and i will try to run MTT on Sunday.



Awesome!  I appreciate it.

  
After fixing the compilation problem in XRC part of code I was able to 
run mtt. Most of the test pass and one test
failed: mpi2c++_dynamics_test. The test pass without XRC. But I also see 
the test failed in trunk. Last time that is working is 1.3a1r17085

strange
Pasha.