Re: [OMPI devel] Integrating the memchecker branch
I have not had a chance to check out the tmp branch for this (I'm currently in an airport without network access), but it all sounds good in principle to me. Forgive me if I've said these things before, but here's what I'd like to see if possible: - configure output shows whether this stuff is enabled - e.g., does it check for the relevant macros in valgrind's header files? (I assume so; I've totally forgotten...) Ensure that these checks are output in configure's stdout - ompi_info shows whether this stuff is enabled - obvious user-level configure errors raise errors/abort configure (E.g., --enable-memchecker is specified but --enable-debug is not), or make some obvious assumptions about "what the user meant" (e.g., if -- enable-memchecker is specified by --enable-debug is not, then automatically enable --enable-debug and output a message saying so). - I think we've said ad nauseam that there should be zero performance penalty for when this stuff is not enabled, and I'm guessing that this is still true. :-) - some kind of documentation should be written up about how to use this stuff, perhaps in the FAQ (e.g., pairing it with a valgrind- enabled libibverbs for max benefit, etc.). If --enable-memchecker is on, --enable-debug should be on as well to make sense On Jan 8, 2008, at 3:11 PM, Rainer Keller wrote: Hello dear all, WHAT: We would like to integrate the changes on the memchecker-branch to trunk, as planned in the WHY: The checking offers memory checking for certain User and OMPI- internal errors, like buffer overruns, size mismatches, checks for wrong send/receive buffers. WHERE: OMPI trunk and v1.3 phase3 WHEN: Integration into Trunk of memchecker branch: 25.1. (although off-by- default, this leaves enough time before Feature Freeze on 8.2.) TIMEOUT: None === The memchecker branch contains checks for memory buffer faults either in the User-Code or in ompi-code itself. It uses the valgrind-API to set/reset buffer validity of the user buffers passed to the MPI-layer. Additionally ompi-internal datatypes are checked for. Both are configurable using the flags: --enable-memchecker --with-valgrind=DIR (if needed) A decent/recent valgrind is needed (for getting and setting VBITS/ using the newer macros). The valgrind-version is being checked for, at least version 3.2.0 is required. The actual checking is done in the MPI-layer, in order not to trap any (correct) access in the BTL, the user buffer is reset to accessible in the PML-layer (currently OB1 -- others won't make much sense?). The default behaviour is to *NOT* enable memchecker. If it is enabled, but not valgrind is being run, the costs for the buffer checks are minimal, the costs for each ompi-datastructure (like datatype, or communicator passed) is not. Further information regarding penalties and performance may be found in: http://www.open-mpi.org/papers/parco-2007 Comments from the Paris meeting have been integrated. Are there any objections or hints? With best regards, Shiqing and Rainer PS: If --enable-memchecker is on, --enable-debug should be on as well to make sense. -- Dipl.-Inf. Rainer Keller http://www.hlrs.de/people/keller HLRS Tel: ++49 (0)711-685 6 5858 Nobelstrasse 19 Fax: ++49 (0)711-685 6 5832 70550 Stuttgartemail: kel...@hlrs.de Germany AIM/Skype:rusraink "Emails save time, not printing them saves trees!" ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] SDP support for OPEN-MPI
On Jan 13, 2008, at 8:19 AM, Lenny Verkhovsky wrote: > What I meant was try to open an SDP socket. If it fails because SDP > is not supported / available to that peer, then open a regular > socket. So you should still always have only 1 socket open to a peer > (not 2). Yes, but since the listener side doesn't know on which socket to expect a message it will need both sockets to be opened. Ah, you meant the listener socket -- not 2 sockets to each peer. Ok, fair enough. Opening up one more listener socket in each process is no big deal (IMO). > > If one of the machine is not supporting SDP user will get an error. > > Well, that's one way to go, but it's certainly less friendly. It > means that the entire MPI job has to support SDP -- including mpirun. > What about clusters that do not have IB on the head node? > They can use OOB over IP sockets and BTL on SDP, it should work. Yes, I'm fine with this -- IIRC, my point was that if SDP is not available (and the user didn't explicitly ask for it), then it should not be an error. > >> Perhaps a more general approach would be to [perhaps additionally] > >> provide an MCA param to allow the user to specify the AF_* value? > >> (AF_INET_SDP is a standardized value, right? I.e., will it be the > >> same on all Linux variants [and someday Solaris]?) > > I didn't find any standard on it, it seems to be "randomly" selected > > since the originally it was 26 and changed to 27 due to conflict with > > kernel's defines. > > This might make an even stronger case for having an MCA param for it > -- if the AF_INET_SDP value is so broken that it's effectively random, > it may be necessary to override it on some platforms (especially in > light of binary OMPI and OFED distributions that may not match). > If we talking about passing AF_INET_SDP value only then 1. Passing this value as mca parameter will not make any changes to the SDP code. 2. Hopefully in the future AF_INET_SDP value can be gotten from the libc, And the value will be configured automatically. 3. If we are talking about AF_INET value in general ( IPv4, IPv6, SDP) Then by making it constant with mca parameter we are limiting ourselves for one protocol only without being able to failover or using different protocols for different needs ( i.e. SDP for OOB and IPv4 for BTL ) I'm not sure what you mean. The AF_INET values for v4 and v6 are constantly compiled into OMPI via whatever values they are in the system header files. They're standardized values, right? My understanding of what you were saying was that AF_INET_SDP is *not* standardized such that it may actually be different values on different systems. Hence, an MPI app could be otherwise portable but have a wrong value for AF_INET_SDP compiled into its code. Are you saying something else? > >> Patrick's got a good point: is there a reason not to do this? > >> (LD_PRELOAD and the like) Is it problematic with the remote orted's? > > Yes, it's problematic with remote orted's and it not really > > transparent > > as you might think. > > Since we can't pass environments' variables to the orted's during > > runtime > > I think this depends on your environment. If you're not using rsh > (which you shouldn't be for a large cluster, which is where SDP would > matter most, right?), the resource manager typically copies the > environment out to the cluster nodes. So an LD_PRELOAD value should > be set for the orteds as well. > > I agree that it's problematic for rsh, but that might also be solvable > (with some limits; there's only so many characters that we can pass on > the command line -- we did investigate having a wrapper to the orted > at one point to accept environment variables and then launch the > orted, but this was so problematic / klunky that we abandoned the idea). > Using LD_PRELOAD will not allow us to use SDP and IP separately, i.e. SDP for OOB and IP for a BTL. Why would you want to do that? I would think that the biggest win here would be SDP for OOB -- the heck with the BTL. The BTL was just done for completeness (right?); if you have OpenFabrics support, you should be using the verbs BTL. Perhaps I don't understand exactly what you are proposing. I was under the impression that you were going after a common case: mpirun and the MPI jobs are running on back-end compute nodes where all of them support SDP (although the other case of mpirun running on the head node without SDP and all the MPI processes are running on back- end nodes with SDP is also not-uncommon...). Are you thinking of something else, or are you looking for more flexibility? -- Jeff Squyres Cisco Systems
Re: [OMPI devel] vt integration -- still problems on os x
Truly weird; I am now unable to reproduce the problem as well. Can you think of any dumb user-level error that I could have done to create this problem? It was very repeatable the other day. :-( Oh well. Barring any objections from Sun, I say that this stuff should *finally* be merged back up to the trunk (you guys have the patience of saints -- many thanks for all your work! :-) ). One last tibit that would be nice to have fixed, though, is to set svn:ignore throughout the vt tree to properly ignore files so that an "svn status" doesn't turn up a bunch of "?" files that really should be ignored by SVN: [8:09] beezle:~/svn/vt-integration/ompi/contrib/vt % svn st | egrep '^ \?' | wc 90 1803149 On Jan 14, 2008, at 4:43 AM, Matthias Jurenz wrote: Hi Jeff, unfortunalety, also for this problem I need some more information, because I could not reproduce this error on our Leopard... Please add the option '-vt:verbose' to the compile command in order that I can see what the VT's compiler wrapper do. Futhermore, could you send me the source file hello.c? Thanks, Matthias On Fr, 2008-01-11 at 13:18 -0500, Jeff Squyres wrote: I am able to compile now on OS X -- great! However, I seem to get some weird errors when running on Leopard: [13:14] beezle:~/tmp/foo % mpicc-vt ../hello.c -o hello [13:14] beezle:~/tmp/foo % nm hello > hello.nm [13:14] beezle:~/tmp/foo % setenv VT_NMFILE ~/tmp/foo/hello.nm [13:14] beezle:~/tmp/foo % mpirun -np 4 hello Hello, world! Hello, world! Hello, world! vtunify: Error: Could not open file ./a.1.uctl Hello, world! That's a weird one -- here's what the dir looks like: [13:14] beezle:~/tmp/foo % ls -l total 352 drwxrwxr-x 7 jsquyres staff 238 Jan 11 13:14 ./ drwxrwxr-x 41 jsquyres staff1394 Jan 11 13:14 ../ -rw-rw-r-- 1 jsquyres staff1601 Jan 11 13:14 a.0.def.z -rw-rw-r-- 1 jsquyres staff 26 Jan 11 13:14 a.1.events.z -rw-rw-r-- 1 jsquyres staff 4 Jan 11 13:14 a.otf -rwxrwxr-x 1 jsquyres staff 150336 Jan 11 13:14 hello* -rw-rw-r-- 1 jsquyres staff 13266 Jan 11 13:14 hello.nm Just for the heckuvit, let's try running again... [13:14] beezle:~/tmp/foo % mpirun -np 4 hello Hello, world! Hello, world! Hello, world! Hello, world! Assertion failed: (p_vecLocDefs->size() > 0), function createGlobal, file vt_unify_defs.cc, line 508. vtunify: Error: Could not open file ./a.1.uctl [13:14] beezle:~/tmp/foo % Yoinks -- an assertion failure... Successive runs seems to be variations on these errors (the assertion failure and various "could not open" and "could not remove" errors). On Jan 11, 2008, at 11:45 AM, Matthias Jurenz wrote: > Hi Jeff, > > I could reproduce the linker problem with the sf.net GCC. Thanks for > your hint. > A header include was missing for STL's functional objects. :-( > > > Matthias > > > On Do, 2008-01-10 at 13:21 -0500, Jeff Squyres wrote: >> >> On Jan 10, 2008, at 10:19 AM, Andreas Knüpfer wrote: >> >> > unfortunately, we're unable to reproduce this error. Could you pass >> > some more >> > information about your configure command line? This was done with >> > gcc 4.2 on >> > mac os X, wasn't it? >> >> I'm on Leopard on my MBP with: >> >> ./configure --prefix=/Users/jsquyres/bogus --enable-mpi-f90 -- >> without- >> threads >> >> But I might see the problem here -- I just realized/remembered that >> I'm using the sf.net GCC install (hpc.sf.net). If I force /usr/ bin/ >> gcc (and friends), it seems to work: >> >> ./configure --prefix=/Users/jsquyres/bogus CC=/usr/bin/gcc CXX=/ usr/ >> bin/g++ --disable-mpi-fortran >> >> However, the hpc.sf.net OS X compilers are not uncommon (because they >> provide fortran compiler support for OS X). Do you think you'll be >> able to test with these compilers? >> > -- > Matthias Jurenz, > Center for Information Services and > High Performance Computing (ZIH), TU Dresden, > Willersbau A106, Zellescher Weg 12, 01062 Dresden > phone +49-351-463-31945, fax +49-351-463-37773 > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Matthias Jurenz, Center for Information Services and High Performance Computing (ZIH), TU Dresden, Willersbau A106, Zellescher Weg 12, 01062 Dresden phone +49-351-463-31945, fax +49-351-463-37773 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] [PATCH] openib btl: extensable cpcselection enablement
On Mon, Jan 14, 2008 at 08:15:23AM -0500, Jeff Squyres (jsquyres) wrote: > Any obj to bringing this stuff to the trunk? The moden string opt stuff can > be done directly on the trunk imo. Go ahead. -- Gleb.
[OMPI devel] carto framework
I added a new framework to Open MPI called carto. You can brows the code in: http://svn.open-mpi.org/svn/ompi/tmp-public/carto/. There are some explanations about the carto framework in the project wiki: https://svn.open-mpi.org/trac/ompi/wiki/OnHostTopologyDescription and you can read the attached doc. The carto framework can't do any damage because no one calls it. So, if there are no objections, I would like to merge the carto framework to the trunk. Sharon. carto_framework_requirements.pdf Description: carto_framework_requirements.pdf
Re: [OMPI devel] vt integration -- still problems on os x
Hi Jeff, unfortunalety, also for this problem I need some more information, because I could not reproduce this error on our Leopard... Please add the option '-vt:verbose' to the compile command in order that I can see what the VT's compiler wrapper do. Futhermore, could you send me the source file hello.c? Thanks, Matthias On Fr, 2008-01-11 at 13:18 -0500, Jeff Squyres wrote: > I am able to compile now on OS X -- great! > > However, I seem to get some weird errors when running on Leopard: > > [13:14] beezle:~/tmp/foo % mpicc-vt ../hello.c -o hello > [13:14] beezle:~/tmp/foo % nm hello > hello.nm > [13:14] beezle:~/tmp/foo % setenv VT_NMFILE ~/tmp/foo/hello.nm > [13:14] beezle:~/tmp/foo % mpirun -np 4 hello > Hello, world! > Hello, world! > Hello, world! > vtunify: Error: Could not open file ./a.1.uctl > Hello, world! > > That's a weird one -- here's what the dir looks like: > > [13:14] beezle:~/tmp/foo % ls -l > total 352 > drwxrwxr-x 7 jsquyres staff 238 Jan 11 13:14 ./ > drwxrwxr-x 41 jsquyres staff1394 Jan 11 13:14 ../ > -rw-rw-r-- 1 jsquyres staff1601 Jan 11 13:14 a.0.def.z > -rw-rw-r-- 1 jsquyres staff 26 Jan 11 13:14 a.1.events.z > -rw-rw-r-- 1 jsquyres staff 4 Jan 11 13:14 a.otf > -rwxrwxr-x 1 jsquyres staff 150336 Jan 11 13:14 hello* > -rw-rw-r-- 1 jsquyres staff 13266 Jan 11 13:14 hello.nm > > Just for the heckuvit, let's try running again... > > [13:14] beezle:~/tmp/foo % mpirun -np 4 hello > Hello, world! > Hello, world! > Hello, world! > Hello, world! > Assertion failed: (p_vecLocDefs->size() > 0), function createGlobal, > file vt_unify_defs.cc, line 508. > vtunify: Error: Could not open file ./a.1.uctl > [13:14] beezle:~/tmp/foo % > > Yoinks -- an assertion failure... > > Successive runs seems to be variations on these errors (the assertion > failure and various "could not open" and "could not remove" errors). > > > > On Jan 11, 2008, at 11:45 AM, Matthias Jurenz wrote: > > > Hi Jeff, > > > > I could reproduce the linker problem with the sf.net GCC. Thanks for > > your hint. > > A header include was missing for STL's functional objects. :-( > > > > > > Matthias > > > > > > On Do, 2008-01-10 at 13:21 -0500, Jeff Squyres wrote: > >> > >> On Jan 10, 2008, at 10:19 AM, Andreas Knüpfer wrote: > >> > >> > unfortunately, we're unable to reproduce this error. Could you pass > >> > some more > >> > information about your configure command line? This was done with > >> > gcc 4.2 on > >> > mac os X, wasn't it? > >> > >> I'm on Leopard on my MBP with: > >> > >> ./configure --prefix=/Users/jsquyres/bogus --enable-mpi-f90 -- > >> without- > >> threads > >> > >> But I might see the problem here -- I just realized/remembered that > >> I'm using the sf.net GCC install (hpc.sf.net). If I force /usr/bin/ > >> gcc (and friends), it seems to work: > >> > >> ./configure --prefix=/Users/jsquyres/bogus CC=/usr/bin/gcc CXX=/usr/ > >> bin/g++ --disable-mpi-fortran > >> > >> However, the hpc.sf.net OS X compilers are not uncommon (because they > >> provide fortran compiler support for OS X). Do you think you'll be > >> able to test with these compilers? > >> > > -- > > Matthias Jurenz, > > Center for Information Services and > > High Performance Computing (ZIH), TU Dresden, > > Willersbau A106, Zellescher Weg 12, 01062 Dresden > > phone +49-351-463-31945, fax +49-351-463-37773 > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- Matthias Jurenz, Center for Information Services and High Performance Computing (ZIH), TU Dresden, Willersbau A106, Zellescher Weg 12, 01062 Dresden phone +49-351-463-31945, fax +49-351-463-37773
Re: [OMPI devel] [PATCH] openib btl: extensable cpc selection enablement
Jon Mason wrote: I have few machines with connectX and i will try to run MTT on Sunday. Awesome! I appreciate it. After fixing the compilation problem in XRC part of code I was able to run mtt. Most of the test pass and one test failed: mpi2c++_dynamics_test. The test pass without XRC. But I also see the test failed in trunk. Last time that is working is 1.3a1r17085 strange Pasha.