Re: [OMPI users] OMPI users] Fortran vs C reductions
Gilles Gouaillardet writes: >> implementation. Must I compile in support for being called with >> MPI_DOUBLE_COMPLEX? >> > does that really matter ? Possibly. For example, if the library needed to define some static data, its setup might involve communicating values before being called with that particular type. That setup phase would fail if the Fortran type is invalid. > i assume your library and the user code are built with the same OpenMPI. > if there is no Fortran support, then you are compiling code that cannot > be invoked (e.g. dead code), > and though that is not the most elegant thing to do, that does not sound > like a showstopper to me. > > so yes, compile in support for being called with Fortran predefined > datatypes, > worst case scenario is you generate broken dead code. No, worst case is that the library crashes at run-time, e.g., during setup of some sort. I don't have a specific library with this behavior, but I can fill in the details to scientifically justify such a thing. Anyway, my suggestion is to either make a compile-time error so that a configure script can test its validity or make it possible to query at run-time whether the type/object is valid. The latter would have the advantage that you could rebuild MPI to add Fortran support and dependent projects would not need to be rebuilt because they saw the same environment. I think that would involve new function(s). signature.asc Description: PGP signature
Re: [OMPI users] OMPI users] Fortran vs C reductions
Dave Love writes: > Jed Brown writes: > >> Isn't that entirely dependent on the Fortran compiler? There is no >> universal requirement that there be a relationship between Fortran >> INTEGER and C int, for example. > > In case it's not generally obvious: the compiler _and its options_. > You can typically change the width of real and double precision, as with > gfortran -default-real-8, and similarly for integer. (It seems unKIND > if the MPI standard specifically enshrines double precision, but > anyhow...) Indeed. (Though such options are an abomination.) >> Feature tests are far more reliable, accurate, and lower maintenance >> than platform/version tests. When a package defines macros/symbols that >> fail at run-time, it makes feature tests much more expensive. Even more >> so when cross-compiling, where run-time tests require batch submission. > > Right, but isn't the existence of the compiler wrapper the appropriate > test for Fortran support, and don't you really need it to run > Fortran-related feature tests? The wrapper might not exist. It doesn't on many prominent platforms today. > I have an "integer*8" build of OMPI, for instance. It's a pain > generally when build systems for MPI stuff avoid compiler wrappers, > and I'd hope that using them could make possibly-unfortunate standards > requirements like this moot. Would there be a problem with that in > this case? In the example of my other reply, my library would not need to call MPI From Fortran, but needs to know whether it needs to compile support for decoding Fortran datatypes. My personal opinion is that compiler wrappers are gross (they don't compose), but systems like CMake that insist on circumventing compiler wrappers in a yet more error-prone way are worse. signature.asc Description: PGP signature
Re: [OMPI users] OMPI users] Fortran vs C reductions
Gilles Gouaillardet writes: > Jed, > > my 0.02US$ > > we recently had a kind of similar discussion about MPI_DATATYPE_NULL, and > we concluded > ompi should do its best to implement the MPI standard, and not what some of > us think the standard should be. Did anyone suggest violating the standard? > in your configure script, you can simply try to compile a simple fortran > MPI hello world. > if it fails, then you can assume fortran bindings are not available, and > not use fortran types in your application. With which compiler? Remember that we're talking about the C macros -- the user of those might not have any Fortran in their code. Like suppose I have a C library that implements a custom reduction. I'll need to be checking the datatype to dispatch to a concrete implementation. Must I compile in support for being called with MPI_DOUBLE_COMPLEX? signature.asc Description: PGP signature
Re: [OMPI users] OMPI users] Fortran vs C reductions
George Bosilca writes: > Now we can argue if DOUBLE PRECISION in Fortran is a double in C. As these > languages are interoperable, and there is no explicit conversion function, > it is safe to assume this is the case. Thus, is seems to me absolutely > legal to provide the MPI-required support for DOUBLE PRECISION despite the > fact that Fortran support is not enabled. Isn't that entirely dependent on the Fortran compiler? There is no universal requirement that there be a relationship between Fortran INTEGER and C int, for example. > Now taking a closer look at the op, I see nothing in the standard the would > require to provide the op if the corresponding language is not supported. > While it could be nice (as a convenience for the users and also because > there is no technical reason not to) to enable the loc op, on non native > datatypes, this is not mandatory. Thus, the current behavior exposed by > Open MPI is acceptable from the standard perspective. I believe the question is not whether it's standard-compliant to define the types when they are not supported (the OP's usage doesn't sound valid anyway because they are using the Fortran MPI datatypes to refer to C types). Rather, the question is: if those types are non-functional, can/should they be removed from the header. This, for example, allows a configure script to test whether those datatypes exist. Feature tests are far more reliable, accurate, and lower maintenance than platform/version tests. When a package defines macros/symbols that fail at run-time, it makes feature tests much more expensive. Even more so when cross-compiling, where run-time tests require batch submission. The fact is that if a package makes it impractical to test features, the end-user experience reflects poorly on that package and all of its dependencies (though which user support passes). It's the sort of thing that drives users and developers away from the platform. Since I don't think you can make the Fortran types reliable without access to a Fortran compiler, my suggestion would be remove the symbols when Fortran is not available. signature.asc Description: PGP signature
Re: [OMPI users] Open MPI MPI-OpenMP Hybrid Binding Question
Dave Love writes: > PETSc can't be using MPI-3 because I'm in the process of fixing rpm > packaging for the current version and building it with ompi 1.6. It would be folly for PETSc to ship with a hard dependency on MPI-3. You wouldn't be able to package it with ompi-1.6, for example. But that doesn't mean PETSc's configure can't test for MPI-3 functionality and use it when available. Indeed, it does (though for different capability than mentioned in this thread). > (Exascale is only of interest if when are spins-off useful for > university-scale systems.) I was hoping for a running example. The relevant example for the technique mentioned in this thread is in src/ksp/ksp/examples/tests/benchmarkscatters of the 'master' versus 'barry/utilize-hwloc' branches. It's completely experimental at this time. signature.asc Description: PGP signature
Re: [OMPI users] [petsc-maint] Deadlock in OpenMPI 1.8.3 and PETSc 3.4.5
"Jeff Squyres (jsquyres)" writes: > This is, unfortunately, an undefined area of the MPI specification. I > do believe that our previous behavior was *correct* -- it just > deadlocks with PETSC because PETSC is relying on undefined behavior. Jeff, can you clarify where in the standard this is left undefined? Is one to assume that callbacks can never call into MPI unless explicitly allowed? Note that empirically, this usage has worked with all implementations since 1998, except this version of Open MPI. If the callback is to be considered invalid, how would you recommend implementing two-way linked communicators? > For those who care, Microsoft/Cisco proposed a new attribute system to > the Forum a while ago that removes all these kinds of ambiguities (see > http://meetings.mpi-forum.org/secretary/2013/09/slides/jsquyres-attributes-revamp.pdf). > However, we didn't get a huge amount of interest, and therefore lost > our window of availability opportunity to be able to advance the > proposal. I'd be more than happy to talk anyone through the proposal > if they have interest/cycles in taking it over and advancing it with > the Forum. > > Two additional points from the PDF listed above: > > - on slide 21, it was decided to no allow the recursive behavior (i.e., you > can ignore the "This is under debate" bullet. > - the "destroy" callback was not judged to be useful; you can ignore slides > 22 and 23. signature.asc Description: PGP signature
Re: [OMPI users] [FEniCS] Question about MPI barriers
Martin Sandve Alnæs writes: > Thanks, but ibarrier doesn't seem to be in the stable version of openmpi: > http://www.open-mpi.org/doc/v1.8/ > Otherwise mpi_ibarrier+mpi_test+homemade time/sleep loop would do the trick. MPI_Ibarrier is there (since 1.7), just missing a man page. pgpBxd2PEYphR.pgp Description: PGP signature
Re: [OMPI users] latest stable and win7/msvc2013
Damien writes: > Visual Studio can link libs compiled with Intel. The headers also need to fall within the language subset implemented by MSVC, but this is easier to ensure and the Windows ecosystem seems to be happy with binary distribution. pgpHHN3ASH0QJ.pgp Description: PGP signature
Re: [OMPI users] latest stable and win7/msvc2013
Ralph Castain writes: > Yeah, but I'm cheap and get the Intel compilers for free :-) Fine for you, but not for the people trying to integrate your library in a stack developed using MSVC. pgpDnaWHH5nyy.pgp Description: PGP signature
Re: [OMPI users] latest stable and win7/msvc2013
Rob Latham writes: > hey, (almost all of) c99 support is in place in visual studio 2013 > http://blogs.msdn.com/b/vcblog/archive/2013/07/19/c99-library-support-in-visual-studio-2013.aspx This talks about the standard library, but not whether the C frontend has acquired these features. Are they attempting to support C99 or merely providing some features to C++ users? BTW, did Microsoft ever fix the misimplementation of variadic macros? http://stackoverflow.com/questions/5134523/msvc-doesnt-expand-va-args-correctly pgp47P3r85ypC.pgp Description: PGP signature
Re: [OMPI users] latest stable and win7/msvc2013
Damien writes: > Is this something that could be funded by Microsoft, and is it time to > approach them perhaps? MS MPI is based on MPICH, and if mainline MPICH > isn't supporting Windows anymore, then there won't be a whole lot of > development in an increasingly older Windows build. With the Open-MPI > roadmap, there's a lot happening. Would it be a better business model > for MS to piggy-back off of Open-MPI ongoing innovation, and put their > resources into maintaining a Windows build of Open-MPI instead? Maybe Fab can comment on Microsoft's intentions regarding MPI and C99/C11 (just dreaming now). > On 2014-07-17 11:42 AM, Jed Brown wrote: >> Rob Latham writes: >>> Well, I (and dgoodell and jsquyers and probably a few others of you) can >>> say from observing disc...@mpich.org traffic that we get one message >>> about Windows support every month -- probably more often. >> Seems to average at least once a week. We also see regular petsc >> support emails wondering why --download-{mpich,openmpi} does not work on >> Windows. (These options are pretty much only used by beginners for whom >> PETSc is their first encounter with MPI.) pgpNqLEhkDbvI.pgp Description: PGP signature
Re: [OMPI users] latest stable and win7/msvc2013
Rob Latham writes: > Well, I (and dgoodell and jsquyers and probably a few others of you) can > say from observing disc...@mpich.org traffic that we get one message > about Windows support every month -- probably more often. Seems to average at least once a week. We also see regular petsc support emails wondering why --download-{mpich,openmpi} does not work on Windows. (These options are pretty much only used by beginners for whom PETSc is their first encounter with MPI.) pgpTrcUQKFdOM.pgp Description: PGP signature
[OMPI users] CXX=no in config.status, breaks mpic++ wrapper
With ompi-git from Monday (7e023a4ebf1aeaa530f79027d00c1bdc16b215fd), configure is putting "compiler=no" in ompi/tools/wrappers/mpic++-wrapper-data.txt: # There can be multiple blocks of configuration data, chosen by # compiler flags (using the compiler_args key to chose which block # should be activated. This can be useful for multilib builds. See the # multilib page at: #https://svn.open-mpi.org/trac/ompi/wiki/compilerwrapper3264 # for more information. project=Open MPI project_short=OMPI version=1.9a1 language=C++ compiler_env=CXX compiler_flags_env=CXXFLAGS compiler=no preprocessor_flags= compiler_flags_prefix= compiler_flags=-pthread linker_flags= -Wl,-rpath -Wl,@{libdir} -Wl,--enable-new-dtags # Note that per https://svn.open-mpi.org/trac/ompi/ticket/3422, we # intentionally only link in the MPI libraries (ORTE, OPAL, etc. are # pulled in implicitly) because we intend MPI applications to only use # the MPI API. libs= -lmpi libs_static= -lmpi -lopen-rte -lopen-pal -lm -lnuma -lpciaccess -ldl dyn_lib_file=libmpi.so static_lib_file=libmpi.a required_file= includedir=${includedir} libdir=${libdir} This breaks the wrapper: $ /path/to/mpic++ -- The Open MPI wrapper compiler was unable to find the specified compiler no in your PATH. Note that this compiler was either specified at configure time or in one of several possible environment variables. -- Attaching logs because it's not obvious to me what is going wrong. Automake-1.14.1 and autoconf-2.69. config.log.xz Description: application/xz config.status.xz Description: application/xz pgpJ2gIX0o4tr.pgp Description: PGP signature
Re: [OMPI users] MPI stats argument in Fortran mpi module
"Jeff Squyres (jsquyres)" writes: >> Totally superficial, just passing "status(1)" instead of "status" or >> "status(1:MPI_STATUS_SIZE)". > > That's a different type (INTEGER scalar vs. INTEGER array). So the > compiler complaining about that is actually correct. Yes, exactly. > Under the covers, Fortran will (most likely) pass both by reference, > so they'll both actually (most likely) *work* if you build with an MPI > that doesn't provide an interface for MPI_Recv, but passing status(1) > is actually incorrect Fortran. Prior to slice notation, this would be the only way to build an array of statuses. I.e., receives go into status(1:MPI_STATUS_SIZE), status(1+MPI_STATUS_SIZE:2*MPI_STATUS_SIZE), etc. Due to pass-by-reference semantics, I think this will always work, despite not type-checking with explicit interfaces. I don't know what the language standard says about backward-compatibility of such constructs, but presumably we need to know the dialect to understand whether it's acceptable. (I actually don't know if the Fortran 77 standard defines the behavior when passing status(1), status(1+MPI_STATUS_SIZE), etc., or whether it works only as a consequence of the only reasonable implementation. > I think you're saying that you agree with my above statements about > the different types, and you're just detailing how you got to asking > about WTF we were providing an MPI_Recv interface in the first place. > Kumbaya. :-) Indeed. pgpRjssQGNBtq.pgp Description: PGP signature
Re: [OMPI users] Regression: Fortran derived types with newer MPI module
"Jeff Squyres (jsquyres)" writes: > As I mentioned Craig and I debated long and hard to change that > default, but, in summary, we apparently missed this clause on p610. > I'll change it back. Okay, thanks. > I'll be happy when gfortran 4.9 is released that supports ignore TKR > and you'll get proper interfaces. :-) Better for everyone. >> I don't call MPI from Fortran, but someone on a Fortran project that I >> watch mentioned that the compiler would complain about such and such a >> use (actually relating to types for MPI_Status in MPI_Recv rather than >> buffer types). > > Can you provide more details here? Choice buffer issues aside, I'm > failing to think of a scenario where you should get a compile mismatch > for the MPI status dummy argument in MPI_Recv... Totally superficial, just passing "status(1)" instead of "status" or "status(1:MPI_STATUS_SIZE)". I extrapolated: how can they provide an explicit interface to MPI_Recv in "use mpi", given portability constraints/existing language standards? pgpCIfMJ5CYnP.pgp Description: PGP signature
Re: [OMPI users] Regression: Fortran derived types with newer MPI module
"Jeff Squyres (jsquyres)" writes: > Yes, I can explain what's going on here. The short version is that a > change was made with the intent to provide maximum Fortran code > safety, but with a possible backwards compatibility issue. If this > change is causing real problems, we can probably change this, but I'd > like a little feedback from the Fortran MPI dev community first. On page 610, I see text disallowing the explicit interfaces in ompi/mpi/fortran/use-mpi-tkr: In S2 and S3: Without such extensions, routines with choice buffers should be provided with an implicit interface, instead of overloading with a different MPI function for each possible buffer type (as mentioned in Section 17.1.11 on page 625). Such overloading would also imply restrictions for passing Fortran derived types as choice buffer, see also Section 17.1.15 on page 629. Why did OMPI decide that this (presumably non-normative) text in the standard was not worth following? (Rejecting something in the standard indicates stronger convictions than would be independently weighing the benefits of each approach.) > c) The design of the MPI-2 "mpi" module has multiple flaws that are > identified in the MPI-3 text (but were not recognized back in MPI-2.x > days). Here's one: until F2008+addendums, there was no Fortran > equivalent of "void *". Hence, the mpi module has to overload > MPI_Send() and have a prototype *for every possible type and > dimension*. And this is not possible, thus the text saying not to do it. I don't call MPI from Fortran, but someone on a Fortran project that I watch mentioned that the compiler would complain about such and such a use (actually relating to types for MPI_Status in MPI_Recv rather than buffer types). My immediate response was "they can't do that because without nonstandard or post-F08 extensions (or exposing the user to c_loc), the type system cannot express those functions and thus you cannot have explicit interfaces". But then I looked at latest OMPI and indeed, it was enumerating types, thus my email. > Here's another fatal flaw: it's not possible for an MPI implementation > to provide MPI_Send() prototypes for user-defined Fortran datatypes. > Hence, the example you cite is a pipe dream for the "mpi" module > because there's no way to specify a (void*)-like argument for the > choice buffer. F2003 has c_loc, which is a sufficient stop-gap until TS 29113 is widely available. I have long-advocated that the best way to write extensible libraries for Fortran2003 callers (even if the library is implemented entirely in Fortran) involves some use of c_loc (e.g., for context arguments). This annoys the Fortran programmers and they usually write perl scripts to generate interfaces that enumerate the types they need and give up on extensibility. ;-) It's nice to know that after 60 years (when Fortran 201x is released, including TS 29113), there will be a Fortran standard with an analogue of void*. > Craig Rasmussen and I debated long and hard about whether to change > the default from "small" to "medium" or not. We finally ended up > doing it with the following rationale: > > - very few codes use the "mpi" module FWIW, I've noticed a few projects transition to it in the last few years. > - but those who do should have the maximum amount of compile-time protection > > ...but we always knew that someone may come complaining some day. And that > day has now come. > > So my question to you / the Fortran MPI dev community is: what do you want > (for gfortran)? > > Do you want us to go back to the "small" size by default, or do you > want more compile-time protection by default? (with the obvious > caveat that you can't use user-defined Fortran datatypes as choice > buffers; you might be able to use something like c_loc, but I haven't > thought deeply about this and don't know offhand if that works) I can't answer this as a Fortran developer, but I know that a lot of projects want some modicum of portability and in practice, it takes almost 10 years to flush the old compilers out of production environments. Either the upgrade problem will need to be fixed [1] so that nearly all existing machines have new compilers or Fortran projects will be wrestling with this for a long time yet. Most Fortran packages I know use homogeneous arrays, which also means that they don't call MPI_Type_create_struct or similar functions. If those functions are going to be provided by the module, I think they should be able to use them (e.g., examples in the Standard should work) and the Standard's advice about implicit interfaces should be followed. [1] Also, there are still production machines without MPI-2.0 and I get email if I make a mistake in providing MPI-1 fallback paths. pgp4Mn5eAmbuu.pgp Description: PGP signature
[OMPI users] Regression: Fortran derived types with newer MPI module
The attached code is from the example on page 629-630 (17.1.15 Fortran Derived Types) of MPI-3. This compiles cleanly with MPICH and with OMPI 1.6.5, but not with the latest OMPI. Arrays higher than rank 4 would have a similar problem since they are not enumerated. Did someone decide that a necessarily-incomplete enumeration of types was "good enough" and that other users should use some other workaround? $ ~/usr/ompi/bin/mpifort -c struct.f90 struct.f90:40.55: call MPI_SEND(foo, 1, newtype, dest, tag, comm, ierr) 1 Error: There is no specific subroutine for the generic 'mpi_send' at (1) struct.f90:43.48: call MPI_GET_ADDRESS(fooarr(1), disp(1), ierr) 1 Error: There is no specific subroutine for the generic 'mpi_get_address' at (1) struct.f90:44.48: call MPI_GET_ADDRESS(fooarr(2), disp(2), ierr) 1 Error: There is no specific subroutine for the generic 'mpi_get_address' at (1) struct.f90:50.61: call MPI_SEND(fooarr, 5, newarrtype, dest, tag, comm, ierr) 1 Error: There is no specific subroutine for the generic 'mpi_send' at (1) $ ~/usr/ompi/bin/ompi_info Package: Open MPI jed@batura Distribution Open MPI: 1.9a1 Open MPI repo revision: r29531M Open MPI release date: Oct 26, 2013 Open RTE: 1.9a1 Open RTE repo revision: r29531M Open RTE release date: Oct 26, 2013 OPAL: 1.9a1 OPAL repo revision: r29531M OPAL release date: Oct 26, 2013 MPI API: 2.2 Ident string: 1.9a1 Prefix: /home/jed/usr/ompi Configured architecture: x86_64-unknown-linux-gnu Configure host: batura Configured by: jed Configured on: Mon Jan 6 19:38:01 CST 2014 Configure host: batura Built by: jed Built on: Mon Jan 6 19:49:41 CST 2014 Built host: batura C bindings: yes C++ bindings: no Fort mpif.h: yes (all) Fort use mpi: yes (limited: overloading) Fort use mpi size: deprecated-ompi-info-value Fort use mpi_f08: no Fort mpi_f08 compliance: The mpi_f08 module was not built Fort mpi_f08 subarrays: no Java bindings: no Wrapper compiler rpath: runpath C compiler: gcc C compiler absolute: /usr/bin/gcc C compiler family name: GNU C compiler version: 4.8.2 C++ compiler: g++ C++ compiler absolute: /usr/bin/g++ Fort compiler: /usr/bin/gfortran Fort compiler abs: Fort ignore TKR: no Fort 08 assumed shape: no Fort optional args: no Fort BIND(C): no Fort PRIVATE: no Fort ABSTRACT: no Fort ASYNCHRONOUS: no Fort PROCEDURE: no Fort f08 using wrappers: yes C profiling: yes C++ profiling: no Fort mpif.h profiling: yes Fort use mpi profiling: yes Fort use mpi_f08 prof: no C++ exceptions: no Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes) Sparse Groups: no Internal debug support: yes MPI interface warnings: yes MPI parameter check: runtime Memory profiling support: no Memory debugging support: no libltdl support: yes Heterogeneous support: no mpirun default --prefix: no MPI I/O support: yes MPI_WTIME support: gettimeofday Symbol vis. support: yes Host topology support: yes MPI extensions: FT Checkpoint support: no (checkpoint thread: no) C/R Enabled Debugging: no VampirTrace support: yes MPI_MAX_PROCESSOR_NAME: 256 MPI_MAX_ERROR_STRING: 256 MPI_MAX_OBJECT_NAME: 64 MPI_MAX_INFO_KEY: 36 MPI_MAX_INFO_VAL: 256 MPI_MAX_PORT_NAME: 1024 MPI_MAX_DATAREP_STRING: 128 MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.9) MCA compress: bzip (MCA v2.0, API v2.0, Component v1.9) MCA compress: gzip (MCA v2.0, API v2.0, Component v1.9) MCA crs: none (MCA v2.0, API v2.0, Component v1.9) MCA db: hash (MCA v2.0, API v1.0, Component v1.9) MCA db: print (MCA v2.0, API v1.0, Component v1.9) MCA event: libevent2021 (MCA v2.0, API v2.0, Component v1.9) MCA hwloc: external (MCA v2.0, API v2.0, Component v1.9) MCA if: linux_ipv6 (MCA v2.0, API v2.0, Component v1.9) MCA if: posix_ipv4 (MCA v2.0, API v2.0, Component v1.9) MCA installdirs: env (MCA v2.0, API v2.0, Component v1.9) MCA installdirs: config (MCA v2.0, API v2.0, Component v1.9) MCA memchecker: valgrind (MCA v2.0, API v2.0, Component v1.9)
Re: [OMPI users] MPI process hangs if OpenMPI is compiled with --enable-thread-multiple
Ralph Castain writes: > Given that we have no idea what Homebrew uses, I don't know how we > could clarify/respond. Pierre provided a link to MacPorts saying that all of the following options were needed to properly enable threads. --enable-event-thread-support --enable-opal-multi-threads --enable-orte-progress-threads --enable-mpi-thread-multiple If that is indeed the case, and if passing some subset of these options results in deadlock, it's not exactly user-friendly. Maybe --enable-mpi-thread-multiple is enough, in which case MacPorts is doing something needlessly complicated and Pierre's link was a red herring? pgp2T2LDecJvY.pgp Description: PGP signature
Re: [OMPI users] MPI process hangs if OpenMPI is compiled with --enable-thread-multiple
Dominique Orban writes: > My question originates from a hang similar to the one I described in > my first message in the PETSc tests. They still hang after I corrected > the OpenMPI compile flags. I'm in touch with the PETSc folks as well > about this. Do you have an updated stack trace? pgpg268u_56n9.pgp Description: PGP signature
Re: [OMPI users] MPI process hangs if OpenMPI is compiled with --enable-thread-multiple
Pierre Jolivet writes: > It looks like you are compiling Open MPI with Homebrew. The flags they use in > the formula when --enable-mpi-thread-multiple is wrong. > c.f. a similar problem with MacPorts > https://lists.macosforge.org/pipermail/macports-tickets/2013-June/138145.html. If these "wrong" configure flags cause deadlock, wouldn't you consider it to be an Open MPI bug? In decreasing order of preference, I would say 1. simple configure flags work to enable feature 2. configure errors due to inconsistent flags 3. configure succeeds, but feature is not actually enabled (so no deadlock, though this is arguably already a bug) pgptnSE6yMaWp.pgp Description: PGP signature
[OMPI users] "C++ compiler absolute"
I built from trunk a couple days ago and notice that mpicxx has an erroneous path: $ ~/usr/ompi/bin/mpicxx -show no -I/homes/jedbrown/usr/ompi/include -pthread -Wl,-rpath -Wl,/homes/jedbrown/usr/ompi/lib -Wl,--enable-new-dtags -L/homes/jedbrown/usr/ompi/lib -lmpi The C compiler is fine $ ~/usr/ompi/bin/mpicc -show /soft/apps/packages/gcc-4.8.0/bin/gcc -I/homes/jedbrown/usr/ompi/include -pthread -Wl,-rpath -Wl,/homes/jedbrown/usr/ompi/lib -Wl,--enable-new-dtags -L/homes/jedbrown/usr/ompi/lib -lmpi I configured with $ ../configure --prefix=/homes/jedbrown/usr/ompi CC=/soft/apps/packages/gcc-4.8.0/bin/gcc CXX=/soft/apps/packages/gcc-4.8.0/bin/g++ FC=/soft/apps/packages/gcc-4.8.0/bin/gfortran These compilers all exist and the build/install went cleanly. So where does this come from? C++ compiler absolute: none $ ~/usr/ompi/bin/ompi_info [140/9616] Package: Open MPI jedbrown@cg Distribution Open MPI: 1.9a1 Open MPI repo revision: r28134M Open MPI release date: Feb 28, 2013 Open RTE: 1.9a1 Open RTE repo revision: r28134M Open RTE release date: Feb 28, 2013 OPAL: 1.9a1 OPAL repo revision: r28134M OPAL release date: Feb 28, 2013 MPI API: 2.1 Ident string: 1.9a1 Prefix: /homes/jedbrown/usr/ompi Configured architecture: x86_64-unknown-linux-gnu Configure host: cg Configured by: jedbrown Configured on: Thu May 23 21:07:25 CDT 2013 Configure host: cg Built by: jedbrown Built on: Thu May 23 21:19:23 CDT 2013 Built host: cg C bindings: yes C++ bindings: no Fort mpif.h: yes (all) Fort use mpi: yes (limited: overloading) Fort use mpi size: deprecated-ompi-info-value Fort use mpi_f08: no Fort mpi_f08 compliance: The mpi_f08 module was not built Fort mpi_f08 subarrays: no Java bindings: no Wrapper compiler rpath: runpath C compiler: /soft/apps/packages/gcc-4.8.0/bin/gcc C compiler absolute: C compiler family name: GNU C compiler version: 4.8.0 C++ compiler: /soft/apps/packages/gcc-4.8.0/bin/g++ C++ compiler absolute: none Fort compiler: /soft/apps/packages/gcc-4.8.0/bin/gfortran Fort compiler abs: Fort ignore TKR: no Fort 08 assumed shape: no Fort optional args: no Fort BIND(C): no Fort PRIVATE: no Fort ABSTRACT: no Fort ASYNCHRONOUS: no Fort PROCEDURE: no Fort f08 using wrappers: yes C profiling: yes C++ profiling: no Fort mpif.h profiling: yes Fort use mpi profiling: yes Fort use mpi_f08 prof: no C++ exceptions: no Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: no, OMPI progress: no, ORTE progress: no, Event lib: yes) Sparse Groups: no Internal debug support: no MPI interface warnings: yes MPI parameter check: runtime Memory profiling support: no Memory debugging support: no libltdl support: yes Heterogeneous support: no mpirun default --prefix: no MPI I/O support: yes MPI_WTIME support: gettimeofday Symbol vis. support: yes Host topology support: yes MPI extensions: FT Checkpoint support: no (checkpoint thread: no) C/R Enabled Debugging: no VampirTrace support: yes MPI_MAX_PROCESSOR_NAME: 256 MPI_MAX_ERROR_STRING: 256 MPI_MAX_OBJECT_NAME: 64 MPI_MAX_INFO_KEY: 36 MPI_MAX_INFO_VAL: 256 MPI_MAX_PORT_NAME: 1024 MPI_MAX_DATAREP_STRING: 128 MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.9) MCA compress: bzip (MCA v2.0, API v2.0, Component v1.9) MCA compress: gzip (MCA v2.0, API v2.0, Component v1.9) MCA crs: none (MCA v2.0, API v2.0, Component v1.9) MCA db: hash (MCA v2.0, API v1.0, Component v1.9) MCA db: print (MCA v2.0, API v1.0, Component v1.9) MCA event: libevent2019 (MCA v2.0, API v2.0, Component v1.9) MCA hwloc: hwloc152 (MCA v2.0, API v2.0, Component v1.9) MCA if: linux_ipv6 (MCA v2.0, API v2.0, Component v1.9) MCA if: posix_ipv4 (MCA v2.0, API v2.0, Component v1.9) MCA installdirs: env (MCA v2.0, API v2.0, Component v1.9) MCA installdirs: config (MCA v2.0, API v2.0, Component v1.9) MCA memory: linux (MCA v2.0, API v2.0, Component v1.9) MCA pstat: linux (MCA v2.0, API v2.0, Component v1.9) MCA shmem: mmap (MCA v2.0, API v2.0, Component v1.9) MCA shmem: posix (MCA v2.0, API v2.
Re: [OMPI users] One-sided bugs
I've resolved the problem in a satisfactory way by circumventing one-sided entirely. I.e., this issue is finally closed: https://bitbucket.org/petsc/petsc-dev/issue/9/implement-petscsf-without-one-sided Users can proceed anyway using the run-time option -acknowledge_ompi_onesided_bug, which will also be a convenient way to test an eventual fix (beyond the reduced test cases that have been sitting in your bug tracker for several years). (This is only relevant with -sf_type window; the default no longer uses one-sided.) I would still like to encourage Open MPI to deliver an error message in this known broken case instead of silently stomping all over the user's memory. On Tue, Sep 11, 2012 at 2:23 PM, Jed Brown wrote: > *Bump* > > There doesn't seem to have been any progress on this. Can you at least > have an error message saying that Open MPI one-sided does not work with > datatypes instead of silently causing wanton corruption and deadlock? > > > On Thu, Dec 22, 2011 at 4:17 PM, Jed Brown wrote: > >> [Forgot the attachment.] >> >> >> On Thu, Dec 22, 2011 at 15:16, Jed Brown wrote: >> >>> I wrote a new communication layer that we are evaluating for use in mesh >>> management and PDE solvers, but it is based on MPI-2 one-sided operations >>> (and will eventually benefit from some of the MPI-3 one-sided proposals, >>> especially MPI_Fetch_and_op() and dynamic windows). All the basic >>> functionality works well with MPICH2, but I have run into some Open MPI >>> bugs regarding one-sided operations with composite data types. This email >>> provides a reduced test case for two such bugs. I see that there are also >>> some existing serious-looking bug reports regarding one-sided operations, >>> but they are getting pretty old now and haven't seen action in a while. >>> >>> https://svn.open-mpi.org/trac/ompi/ticket/2656 >>> https://svn.open-mpi.org/trac/ompi/ticket/1905 >>> >>> Is there a plan for resolving these in the near future? >>> >>> Is anyone using Open MPI for serious work with one-sided operations? >>> >>> >>> Bugs I am reporting: >>> >>> *1.* If an MPI_Win is used with an MPI_Datatype, even if the MPI_Win >>> operation has completed, I get an invalid free when MPI_Type_free() is >>> called before MPI_Win_free(). Since MPI_Type_free() is only supposed to >>> mark the datatype for deletion, the implementation should properly manage >>> reference counting. If you run the attached code with >>> >>> $ mpiexec -n 2 ./a.out 1 >>> >>> (which only does part of the comm described for the second bug, below), >>> you can see the invalid free on rank 1 with stack still in MPI_Win_fence() >>> >>> (gdb) bt >>> #0 0x77288905 in raise () from /lib/libc.so.6 >>> #1 0x77289d7b in abort () from /lib/libc.so.6 >>> #2 0x772c147e in __libc_message () from /lib/libc.so.6 >>> #3 0x772c7396 in malloc_printerr () from /lib/libc.so.6 >>> #4 0x772cb26c in free () from /lib/libc.so.6 >>> #5 0x77a5aaa8 in ompi_datatype_release_args (pData=0x845010) at >>> ompi_datatype_args.c:414 >>> #6 0x77a5b0ea in __ompi_datatype_release (datatype=0x845010) at >>> ompi_datatype_create.c:47 >>> #7 0x7218e772 in opal_obj_run_destructors (object=0x845010) at >>> ../../../../opal/class/opal_object.h:448 >>> #8 ompi_osc_rdma_replyreq_free (replyreq=0x680a80) at >>> osc_rdma_replyreq.h:136 >>> #9 ompi_osc_rdma_replyreq_send_cb (btl=0x73680ce0, >>> endpoint=, descriptor=0x837b00, status=) at >>> osc_rdma_data_move.c:691 >>> #10 0x7347f38f in mca_btl_sm_component_progress () at >>> btl_sm_component.c:645 >>> #11 0x77b1f80a in opal_progress () at runtime/opal_progress.c:207 >>> #12 0x721977c5 in opal_condition_wait (m=, >>> c=0x842ee0) at ../../../../opal/threads/condition.h:99 >>> #13 ompi_osc_rdma_module_fence (assert=0, win=0x842270) at >>> osc_rdma_sync.c:207 >>> #14 0x77a89db5 in PMPI_Win_fence (assert=0, win=0x842270) at >>> pwin_fence.c:60 >>> #15 0x004010d8 in main (argc=2, argv=0x7fffd508) at win.c:60 >>> >>> meanwhile, rank 0 has already freed the datatype and is waiting in >>> MPI_Win_free(). >>> (gdb) bt >>> #0 0x77312107 in sched_yield () from /lib/libc.so.6 >>> #1 0x77b1f82b in opal_progress () at runtime/opal_progress.c:220 >>> #2 0x000
Re: [OMPI users] Setting RPATH for Open MPI libraries
Jeff, we are averaging a half dozen support threads per week on PETSc lists/email caused by lack of RPATH in Open MPI for non-standard install locations. Can you either make the necessary environment modification more visible for novice users or implement the RPATH option? On Wed, Sep 12, 2012 at 1:52 PM, Jed Brown wrote: > On Wed, Sep 12, 2012 at 10:20 AM, Jeff Squyres wrote: > >> We have a long-standing philosophy that OMPI should add the bare minimum >> number of preprocessor/compiler/linker flags to its wrapper compilers, and >> let the user/administrator customize from there. >> > > In general, I agree with that philosophy. > > >> >> That being said, a looong time ago, I started a patch to add a >> --with-rpath option to configure, but never finished it. :-\ I can try to >> get it back on my to-do list. >> > > That would be perfect. > > >> >> For the moment, you might want to try the configure >> --enable-mpirun-prefix-by-default option, too. >> > > The downside is that we tend not to bother with the mpirun for configure > and it's a little annoying to "mpirun ldd" when hunting for other problems > (e.g. a missing shared lib unrelated to Open MPI). >
Re: [OMPI users] Setting RPATH for Open MPI libraries
On Wed, Sep 12, 2012 at 10:20 AM, Jeff Squyres wrote: > We have a long-standing philosophy that OMPI should add the bare minimum > number of preprocessor/compiler/linker flags to its wrapper compilers, and > let the user/administrator customize from there. > In general, I agree with that philosophy. > > That being said, a looong time ago, I started a patch to add a > --with-rpath option to configure, but never finished it. :-\ I can try to > get it back on my to-do list. > That would be perfect. > > For the moment, you might want to try the configure > --enable-mpirun-prefix-by-default option, too. > The downside is that we tend not to bother with the mpirun for configure and it's a little annoying to "mpirun ldd" when hunting for other problems (e.g. a missing shared lib unrelated to Open MPI).
Re: [OMPI users] One-sided bugs
*Bump* There doesn't seem to have been any progress on this. Can you at least have an error message saying that Open MPI one-sided does not work with datatypes instead of silently causing wanton corruption and deadlock? On Thu, Dec 22, 2011 at 4:17 PM, Jed Brown wrote: > [Forgot the attachment.] > > > On Thu, Dec 22, 2011 at 15:16, Jed Brown wrote: > >> I wrote a new communication layer that we are evaluating for use in mesh >> management and PDE solvers, but it is based on MPI-2 one-sided operations >> (and will eventually benefit from some of the MPI-3 one-sided proposals, >> especially MPI_Fetch_and_op() and dynamic windows). All the basic >> functionality works well with MPICH2, but I have run into some Open MPI >> bugs regarding one-sided operations with composite data types. This email >> provides a reduced test case for two such bugs. I see that there are also >> some existing serious-looking bug reports regarding one-sided operations, >> but they are getting pretty old now and haven't seen action in a while. >> >> https://svn.open-mpi.org/trac/ompi/ticket/2656 >> https://svn.open-mpi.org/trac/ompi/ticket/1905 >> >> Is there a plan for resolving these in the near future? >> >> Is anyone using Open MPI for serious work with one-sided operations? >> >> >> Bugs I am reporting: >> >> *1.* If an MPI_Win is used with an MPI_Datatype, even if the MPI_Win >> operation has completed, I get an invalid free when MPI_Type_free() is >> called before MPI_Win_free(). Since MPI_Type_free() is only supposed to >> mark the datatype for deletion, the implementation should properly manage >> reference counting. If you run the attached code with >> >> $ mpiexec -n 2 ./a.out 1 >> >> (which only does part of the comm described for the second bug, below), >> you can see the invalid free on rank 1 with stack still in MPI_Win_fence() >> >> (gdb) bt >> #0 0x77288905 in raise () from /lib/libc.so.6 >> #1 0x77289d7b in abort () from /lib/libc.so.6 >> #2 0x772c147e in __libc_message () from /lib/libc.so.6 >> #3 0x772c7396 in malloc_printerr () from /lib/libc.so.6 >> #4 0x772cb26c in free () from /lib/libc.so.6 >> #5 0x77a5aaa8 in ompi_datatype_release_args (pData=0x845010) at >> ompi_datatype_args.c:414 >> #6 0x77a5b0ea in __ompi_datatype_release (datatype=0x845010) at >> ompi_datatype_create.c:47 >> #7 0x7218e772 in opal_obj_run_destructors (object=0x845010) at >> ../../../../opal/class/opal_object.h:448 >> #8 ompi_osc_rdma_replyreq_free (replyreq=0x680a80) at >> osc_rdma_replyreq.h:136 >> #9 ompi_osc_rdma_replyreq_send_cb (btl=0x73680ce0, >> endpoint=, descriptor=0x837b00, status=) at >> osc_rdma_data_move.c:691 >> #10 0x7347f38f in mca_btl_sm_component_progress () at >> btl_sm_component.c:645 >> #11 0x77b1f80a in opal_progress () at runtime/opal_progress.c:207 >> #12 0x721977c5 in opal_condition_wait (m=, >> c=0x842ee0) at ../../../../opal/threads/condition.h:99 >> #13 ompi_osc_rdma_module_fence (assert=0, win=0x842270) at >> osc_rdma_sync.c:207 >> #14 0x77a89db5 in PMPI_Win_fence (assert=0, win=0x842270) at >> pwin_fence.c:60 >> #15 0x004010d8 in main (argc=2, argv=0x7fffd508) at win.c:60 >> >> meanwhile, rank 0 has already freed the datatype and is waiting in >> MPI_Win_free(). >> (gdb) bt >> #0 0x77312107 in sched_yield () from /lib/libc.so.6 >> #1 0x77b1f82b in opal_progress () at runtime/opal_progress.c:220 >> #2 0x77a53fe4 in opal_condition_wait (m=, >> c=) at ../opal/threads/condition.h:99 >> #3 ompi_request_default_wait_all (count=2, requests=0x7fffd210, >> statuses=0x7fffd1e0) at request/req_wait.c:263 >> #4 0x725b8d71 in ompi_coll_tuned_sendrecv_actual (sendbuf=0x0, >> scount=0, sdatatype=0x77dba840, dest=1, stag=-16, recvbuf=> out>, rcount=0, rdatatype=0x77dba840, source=1, rtag=-16, >> comm=0x8431a0, status=0x0) at coll_tuned_util.c:54 >> #5 0x725c2de2 in ompi_coll_tuned_barrier_intra_two_procs >> (comm=, module=) at coll_tuned_barrier.c:256 >> #6 0x725b92ab in ompi_coll_tuned_barrier_intra_dec_fixed >> (comm=0x8431a0, module=0x844980) at coll_tuned_decision_fixed.c:190 >> #7 0x72186248 in ompi_osc_rdma_module_free (win=0x842170) at >> osc_rdma.c:46 >> #8 0x77a58a44 in ompi_win_free (win=0x842170) at win/win.c:150 >> #9 0x77a8a0dd in PMPI_Win_free (win=0x7fffd408) at >> pwin_free.c:
Re: [OMPI users] Setting RPATH for Open MPI libraries
On Tue, Sep 11, 2012 at 2:29 PM, Reuti wrote: > With "user" you mean someone compiling Open MPI? Yes
Re: [OMPI users] Setting RPATH for Open MPI libraries
I want to avoid the user having to figure that out. MPICH2 sets RPATH by default when installed to nonstandard locations and I think that is not a bad choice. Usually applications are compiled differetly when the want to switch between debug and optimized (or other reasons for selecting a different library using LD_LIBRARY_PATH). On Sep 8, 2012 2:48 PM, "Reuti" wrote: > Hi, > > Am 08.09.2012 um 14:46 schrieb Jed Brown: > > > Is there a way to configure Open MPI to use RPATH without needing to > manually specify --with-wrapper-ldflags=-Wl,-rpath,${prefix}/lib (and > similar for non-GNU-compatible compilers)? > ___ > > What do you want to achieve in detail - just shorten the `./configure` > command line? You could also add it after Open MPI's compilation in the > text file: > > ${prefix}/share/openmpi/mpicc-wrapper-data.txt > > -- Reuti > > > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] Setting RPATH for Open MPI libraries
Is there a way to configure Open MPI to use RPATH without needing to manually specify --with-wrapper-ldflags=-Wl,-rpath,${prefix}/lib (and similar for non-GNU-compatible compilers)?
Re: [OMPI users] [EXTERNAL] Re: mpicc link shouldn't add -ldl and -lhwloc
On Thu, May 31, 2012 at 6:20 AM, Jeff Squyres wrote: > On May 29, 2012, at 11:42 AM, Jed Brown wrote: > > > The pkg-config approach is to use pkg-config --static if you want to > link that library statically. > > Do the OMPI pkg-config files not do this properly? > Looks right to me. I think the complaint was that there was no way to specify the equivalent using wrapper compilers. I don't like the wrapper compiler model (certainly not for languages with a common ABI like C), but pkg-config doesn't have a good way to manage multiple configurations. > > > So the problem is almost exclusively one of binary compatibility. If an > app or library is only linked to the interface libs, underlying system > libraries can be upgraded to different soname without needing to relink the > applications. For example, libhwloc could be upgraded to a different ABI, > Open MPI rebuilt, and then the user application and intermediate MPI-based > libraries would not need to be rebuilt. This is great for distributions and > convenient if you work on projects with lots of dependencies. > > > > It's not such an issue for HPC applications because we tend to recompile > a lot and don't need binary compatibility for many of the most common use > cases. There is also the linker option -Wl,--as-needed that usually does > what is desired. > > Mmmm. Ok. Brian and I are going to be in the same physical location next > week; I'll chat with him about this.
Re: [OMPI users] [EXTERNAL] Re: mpicc link shouldn't add -ldl and -lhwloc
On Tue, May 29, 2012 at 9:05 AM, Jeff Squyres wrote: > > > We've tossed around ideas such as having the wrappers always assume > dynamic linking (e.g., only include a minimum of libraries), and then add > another wrapper option like --wrapper:static (or whatever) to know when to > add in all the dependent libraries. Or possibly even look for some popular > linker options like --static, or some such (which we've tried to avoid, > because that can turn into a slippery slope), but such switches aren't > always necessary for MPI-only-static (vs. completely-100%-static) linking. > It gets even fuzzier when both libmpi.so and libmpi.a are present. Which > way should we assume? > > Another problem is backwards compatibility -- users who are currently > statically linking will assume the old behavior (of not needing to specify > anything additional). > > > Now I'm not saying that Open MPI should commit to pkg-config instead of > wrapper compilers, but the concept of linking differently for static versus > shared libraries is something that should be observed. > > Fair enough. But we've never been able to come up with a rational way to > do it (note that pkg-config has its own problems -- OMPI provides > pkg-config files in addition to wrapper compilers, but they don't fix > everything, either). > > We have users who both --enable-static and --enable-shared (meaning: both > libmpi.so and libmpi.a are present). And therefore we've come down on the > conservative side of adding in whatever is necessary for static linking. > The pkg-config approach is to use pkg-config --static if you want to link that library statically. > > > (Over-linking is an ongoing problem with HPC-oriented packages. We are > probably all guilty of it, but tools like pkg-config don't handle multiple > configurations well and I don't know of a similar system that manages both > static/shared and multi-configuration well.) > > I suppose, but it does depend on how you define "problem". The linker > will ignore any unused libraries -- so it's a problem like lint is a > problem. It's annoying, but it doesn't do any harm. > > ...or are there cases where it actually does something harmful? > So the problem is almost exclusively one of binary compatibility. If an app or library is only linked to the interface libs, underlying system libraries can be upgraded to different soname without needing to relink the applications. For example, libhwloc could be upgraded to a different ABI, Open MPI rebuilt, and then the user application and intermediate MPI-based libraries would not need to be rebuilt. This is great for distributions and convenient if you work on projects with lots of dependencies. It's not such an issue for HPC applications because we tend to recompile a lot and don't need binary compatibility for many of the most common use cases. There is also the linker option -Wl,--as-needed that usually does what is desired.
Re: [OMPI users] [EXTERNAL] Re: mpicc link shouldn't add -ldl and -lhwloc
On Wed, May 23, 2012 at 8:29 AM, Barrett, Brian W wrote: > >I should add the caveat that they are need when linking statically, but > >not when using shared libraries. > > And therein lies the problem. We have a number of users who build Open > MPI statically and even some who build both static and shared libraries in > the same build. We've never been able to figure out a reasonable way to > guess if we need to add -lhwloc or -ldl, so we add them. It's better to > list them and have some redundant dependencies (since you have to have the > library anyways) than to not list them and have odd link errors. So pkg-config has the --static option for exactly this reason. Let's look at Cairo as an example. $ cat /usr/lib/pkgconfig/cairo.pc prefix=/usr exec_prefix=${prefix} libdir=${exec_prefix}/lib includedir=${prefix}/include Name: cairo Description: Multi-platform 2D graphics library Version: 1.12.2 Requires.private: gobject-2.0 glib-2.0 pixman-1 >= 0.22.0 fontconfig >= 2.2.95 freetype2 >= 9.7.3 libpng xcb-shm xcb >= 1.6 xcb-render >= 1.6 xrender >= 0.6 x11 Libs: -L${libdir} -lcairo Libs.private:-lz -lz Cflags: -I${includedir}/cairo $ pkg-config cairo --libs -lcairo $ pkg-config cairo --libs --static -pthread -lcairo -lgobject-2.0 -lffi -lpixman-1 -lfontconfig -lexpat -lfreetype -lbz2 -lpng15 -lz -lm -lxcb-shm -lxcb-render -lXrender -lglib-2.0 -lrt -lpcre -lX11 -lpthread -lxcb -lXau -lXdmcp $ ldd /usr/lib/libcairo.so linux-vdso.so.1 => (0x7fff741ff000) libpthread.so.0 => /lib/libpthread.so.0 (0x7f135eac7000) libpixman-1.so.0 => /usr/lib/libpixman-1.so.0 (0x7f135e83f000) libfontconfig.so.1 => /usr/lib/libfontconfig.so.1 (0x7f135e608000) libfreetype.so.6 => /usr/lib/libfreetype.so.6 (0x7f135e369000) libpng15.so.15 => /usr/lib/libpng15.so.15 (0x7f135e13c000) libxcb-shm.so.0 => /usr/lib/libxcb-shm.so.0 (0x7f135df39000) libxcb-render.so.0 => /usr/lib/libxcb-render.so.0 (0x7f135dd3) libxcb.so.1 => /usr/lib/libxcb.so.1 (0x7f135db12000) libXrender.so.1 => /usr/lib/libXrender.so.1 (0x7f135d906000) libX11.so.6 => /usr/lib/libX11.so.6 (0x7f135d5cc000) libz.so.1 => /usr/lib/libz.so.1 (0x7f135d3b6000) librt.so.1 => /lib/librt.so.1 (0x7f135d1ad000) libm.so.6 => /lib/libm.so.6 (0x7f135ceb8000) libc.so.6 => /lib/libc.so.6 (0x7f135cb17000) /lib/ld-linux-x86-64.so.2 (0x7f135f012000) libbz2.so.1.0 => /usr/lib/libbz2.so.1.0 (0x7f135c906000) libexpat.so.1 => /usr/lib/libexpat.so.1 (0x7f135c6dc000) libXau.so.6 => /usr/lib/libXau.so.6 (0x7f135c4d8000) libXdmcp.so.6 => /usr/lib/libXdmcp.so.6 (0x7f135c2d1000) libdl.so.2 => /lib/libdl.so.2 (0x7f135c0cd000) Now I'm not saying that Open MPI should commit to pkg-config instead of wrapper compilers, but the concept of linking differently for static versus shared libraries is something that should be observed. (Over-linking is an ongoing problem with HPC-oriented packages. We are probably all guilty of it, but tools like pkg-config don't handle multiple configurations well and I don't know of a similar system that manages both static/shared and multi-configuration well.)
Re: [OMPI users] AlltoallV (with some zero send count values)
On Tue, Mar 6, 2012 at 15:43, Timothy Stitt wrote: > Can anyone tell me whether it is legal to pass zero values for some of the > send count elements in an MPI_AlltoallV() call? I want to perform an > all-to-all operation but for performance reasons do not want to send data > to various ranks who don't need to receive any useful values. If it is > legal, can I assume the implementation is smart enough to not send messages > when the send count is 0? > > FYI: my tests show that AlltoallV operations with various send count > values set to 0...hangs. > This is allowed by the standard, but be warned that it is likely to perform poorly compared to what could be done with point-to-point or one-sided operations if most links are empty.
Re: [OMPI users] parallelising ADI
On Tue, Mar 6, 2012 at 16:23, Tim Prince wrote: > On 03/06/2012 03:59 PM, Kharche, Sanjay wrote: > >> Hi >> >> I am working on a 3D ADI solver for the heat equation. I have implemented >> it as serial. Would anybody be able to indicate the best and more >> straightforward way to parallelise it. Apologies if this is going to the >> wrong forum. >> >> >> If it's to be implemented in parallelizable fashion (not SSOR style > where each line uses updates from the previous line), it should be feasible > to divide the outer loop into an appropriate number of blocks, or decompose > the physical domain and perform ADI on individual blocks, then update and > repeat. True ADI has inherently high communication cost because a lot of data movement is needed to make the _fundamentally sequential_ tridiagonal solves local enough that latency doesn't kill you trying to keep those solves distributed. This also applies (albeit to a lesser degree) in serial due to way memory works. If you only do non-overlapping subdomain solves, you must use a Krylov method just to ensure convergence. You can add overlap, but the Krylov method is still necessary for any practical convergence rate. The method will also require an iteration count proportional to the number of subdomains across the global domain times the square root of the number of elements across a subdomain. The constants may not be small and this asymptotic result is independent of what the subdomain solver is. You need a coarse level to improve this scaling. Sanjay, as Matt and I recommended when you asked the same question on the PETSc list this morning, unless this is a homework assignment, you should solve your problem with multigrid instead of ADI. We pointed you to simple example code that scales well to many thousands of processes.
Re: [OMPI users] One-sided bugs
[Forgot the attachment.] On Thu, Dec 22, 2011 at 15:16, Jed Brown wrote: > I wrote a new communication layer that we are evaluating for use in mesh > management and PDE solvers, but it is based on MPI-2 one-sided operations > (and will eventually benefit from some of the MPI-3 one-sided proposals, > especially MPI_Fetch_and_op() and dynamic windows). All the basic > functionality works well with MPICH2, but I have run into some Open MPI > bugs regarding one-sided operations with composite data types. This email > provides a reduced test case for two such bugs. I see that there are also > some existing serious-looking bug reports regarding one-sided operations, > but they are getting pretty old now and haven't seen action in a while. > > https://svn.open-mpi.org/trac/ompi/ticket/2656 > https://svn.open-mpi.org/trac/ompi/ticket/1905 > > Is there a plan for resolving these in the near future? > > Is anyone using Open MPI for serious work with one-sided operations? > > > Bugs I am reporting: > > *1.* If an MPI_Win is used with an MPI_Datatype, even if the MPI_Win > operation has completed, I get an invalid free when MPI_Type_free() is > called before MPI_Win_free(). Since MPI_Type_free() is only supposed to > mark the datatype for deletion, the implementation should properly manage > reference counting. If you run the attached code with > > $ mpiexec -n 2 ./a.out 1 > > (which only does part of the comm described for the second bug, below), > you can see the invalid free on rank 1 with stack still in MPI_Win_fence() > > (gdb) bt > #0 0x77288905 in raise () from /lib/libc.so.6 > #1 0x77289d7b in abort () from /lib/libc.so.6 > #2 0x772c147e in __libc_message () from /lib/libc.so.6 > #3 0x772c7396 in malloc_printerr () from /lib/libc.so.6 > #4 0x772cb26c in free () from /lib/libc.so.6 > #5 0x77a5aaa8 in ompi_datatype_release_args (pData=0x845010) at > ompi_datatype_args.c:414 > #6 0x77a5b0ea in __ompi_datatype_release (datatype=0x845010) at > ompi_datatype_create.c:47 > #7 0x7218e772 in opal_obj_run_destructors (object=0x845010) at > ../../../../opal/class/opal_object.h:448 > #8 ompi_osc_rdma_replyreq_free (replyreq=0x680a80) at > osc_rdma_replyreq.h:136 > #9 ompi_osc_rdma_replyreq_send_cb (btl=0x73680ce0, > endpoint=, descriptor=0x837b00, status=) at > osc_rdma_data_move.c:691 > #10 0x7347f38f in mca_btl_sm_component_progress () at > btl_sm_component.c:645 > #11 0x77b1f80a in opal_progress () at runtime/opal_progress.c:207 > #12 0x721977c5 in opal_condition_wait (m=, > c=0x842ee0) at ../../../../opal/threads/condition.h:99 > #13 ompi_osc_rdma_module_fence (assert=0, win=0x842270) at > osc_rdma_sync.c:207 > #14 0x77a89db5 in PMPI_Win_fence (assert=0, win=0x842270) at > pwin_fence.c:60 > #15 0x004010d8 in main (argc=2, argv=0x7fffd508) at win.c:60 > > meanwhile, rank 0 has already freed the datatype and is waiting in > MPI_Win_free(). > (gdb) bt > #0 0x77312107 in sched_yield () from /lib/libc.so.6 > #1 0x77b1f82b in opal_progress () at runtime/opal_progress.c:220 > #2 0x77a53fe4 in opal_condition_wait (m=, > c=) at ../opal/threads/condition.h:99 > #3 ompi_request_default_wait_all (count=2, requests=0x7fffd210, > statuses=0x7fffd1e0) at request/req_wait.c:263 > #4 0x725b8d71 in ompi_coll_tuned_sendrecv_actual (sendbuf=0x0, > scount=0, sdatatype=0x77dba840, dest=1, stag=-16, recvbuf= out>, rcount=0, rdatatype=0x77dba840, source=1, rtag=-16, > comm=0x8431a0, status=0x0) at coll_tuned_util.c:54 > #5 0x725c2de2 in ompi_coll_tuned_barrier_intra_two_procs > (comm=, module=) at coll_tuned_barrier.c:256 > #6 0x725b92ab in ompi_coll_tuned_barrier_intra_dec_fixed > (comm=0x8431a0, module=0x844980) at coll_tuned_decision_fixed.c:190 > #7 0x72186248 in ompi_osc_rdma_module_free (win=0x842170) at > osc_rdma.c:46 > #8 0x77a58a44 in ompi_win_free (win=0x842170) at win/win.c:150 > #9 0x77a8a0dd in PMPI_Win_free (win=0x7fffd408) at > pwin_free.c:56 > #10 0x00401195 in main (argc=2, argv=0x7fffd508) at win.c:69 > > > *2.* This appears to be more fundamental and perhaps much harder to fix. > The attached code sets up the following graph > > rank 0: > 0 -> (1,0) > 1 -> nothing > 2 -> (1,1) > > rank 1: > 0 -> (0,0) > 1 -> (0,2) > 2 -> (0,1) > > We pull over this graph using two calls to MPI_Get(), each with composite > data types defining what to pull into the first two slots, and what to put > into the third slot. It is Valgrind-clean with MPICH2, and produces the > fol
[OMPI users] One-sided bugs
I wrote a new communication layer that we are evaluating for use in mesh management and PDE solvers, but it is based on MPI-2 one-sided operations (and will eventually benefit from some of the MPI-3 one-sided proposals, especially MPI_Fetch_and_op() and dynamic windows). All the basic functionality works well with MPICH2, but I have run into some Open MPI bugs regarding one-sided operations with composite data types. This email provides a reduced test case for two such bugs. I see that there are also some existing serious-looking bug reports regarding one-sided operations, but they are getting pretty old now and haven't seen action in a while. https://svn.open-mpi.org/trac/ompi/ticket/2656 https://svn.open-mpi.org/trac/ompi/ticket/1905 Is there a plan for resolving these in the near future? Is anyone using Open MPI for serious work with one-sided operations? Bugs I am reporting: *1.* If an MPI_Win is used with an MPI_Datatype, even if the MPI_Win operation has completed, I get an invalid free when MPI_Type_free() is called before MPI_Win_free(). Since MPI_Type_free() is only supposed to mark the datatype for deletion, the implementation should properly manage reference counting. If you run the attached code with $ mpiexec -n 2 ./a.out 1 (which only does part of the comm described for the second bug, below), you can see the invalid free on rank 1 with stack still in MPI_Win_fence() (gdb) bt #0 0x77288905 in raise () from /lib/libc.so.6 #1 0x77289d7b in abort () from /lib/libc.so.6 #2 0x772c147e in __libc_message () from /lib/libc.so.6 #3 0x772c7396 in malloc_printerr () from /lib/libc.so.6 #4 0x772cb26c in free () from /lib/libc.so.6 #5 0x77a5aaa8 in ompi_datatype_release_args (pData=0x845010) at ompi_datatype_args.c:414 #6 0x77a5b0ea in __ompi_datatype_release (datatype=0x845010) at ompi_datatype_create.c:47 #7 0x7218e772 in opal_obj_run_destructors (object=0x845010) at ../../../../opal/class/opal_object.h:448 #8 ompi_osc_rdma_replyreq_free (replyreq=0x680a80) at osc_rdma_replyreq.h:136 #9 ompi_osc_rdma_replyreq_send_cb (btl=0x73680ce0, endpoint=, descriptor=0x837b00, status=) at osc_rdma_data_move.c:691 #10 0x7347f38f in mca_btl_sm_component_progress () at btl_sm_component.c:645 #11 0x77b1f80a in opal_progress () at runtime/opal_progress.c:207 #12 0x721977c5 in opal_condition_wait (m=, c=0x842ee0) at ../../../../opal/threads/condition.h:99 #13 ompi_osc_rdma_module_fence (assert=0, win=0x842270) at osc_rdma_sync.c:207 #14 0x77a89db5 in PMPI_Win_fence (assert=0, win=0x842270) at pwin_fence.c:60 #15 0x004010d8 in main (argc=2, argv=0x7fffd508) at win.c:60 meanwhile, rank 0 has already freed the datatype and is waiting in MPI_Win_free(). (gdb) bt #0 0x77312107 in sched_yield () from /lib/libc.so.6 #1 0x77b1f82b in opal_progress () at runtime/opal_progress.c:220 #2 0x77a53fe4 in opal_condition_wait (m=, c=) at ../opal/threads/condition.h:99 #3 ompi_request_default_wait_all (count=2, requests=0x7fffd210, statuses=0x7fffd1e0) at request/req_wait.c:263 #4 0x725b8d71 in ompi_coll_tuned_sendrecv_actual (sendbuf=0x0, scount=0, sdatatype=0x77dba840, dest=1, stag=-16, recvbuf=, rcount=0, rdatatype=0x77dba840, source=1, rtag=-16, comm=0x8431a0, status=0x0) at coll_tuned_util.c:54 #5 0x725c2de2 in ompi_coll_tuned_barrier_intra_two_procs (comm=, module=) at coll_tuned_barrier.c:256 #6 0x725b92ab in ompi_coll_tuned_barrier_intra_dec_fixed (comm=0x8431a0, module=0x844980) at coll_tuned_decision_fixed.c:190 #7 0x72186248 in ompi_osc_rdma_module_free (win=0x842170) at osc_rdma.c:46 #8 0x77a58a44 in ompi_win_free (win=0x842170) at win/win.c:150 #9 0x77a8a0dd in PMPI_Win_free (win=0x7fffd408) at pwin_free.c:56 #10 0x00401195 in main (argc=2, argv=0x7fffd508) at win.c:69 *2.* This appears to be more fundamental and perhaps much harder to fix. The attached code sets up the following graph rank 0: 0 -> (1,0) 1 -> nothing 2 -> (1,1) rank 1: 0 -> (0,0) 1 -> (0,2) 2 -> (0,1) We pull over this graph using two calls to MPI_Get(), each with composite data types defining what to pull into the first two slots, and what to put into the third slot. It is Valgrind-clean with MPICH2, and produces the following: $ mpiexec.hydra -n 2 ./a.out 2 [0] provided [100,101,102] got [200, -2,201] [1] provided [200,201,202] got [100,102,101] With Open MPI, I see a.out: malloc.c:3096: sYSMALLOc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)' failed. on both ranks, wi
Re: [OMPI users] SpMV Benchmarks
On Thu, May 5, 2011 at 23:15, Paul Monday (Parallel Scientific) < paul.mon...@parsci.com> wrote: > Hi, I'm hoping someone can help me locate a SpMV benchmark that runs w/ > Open MPI so I can benchmark how my systems are interacting with the network > as I add nodes / cores to the pool of systems. I can find SpMV benchmarks > for single processor / OpenMP all over, but these networked ones are proving > harder to come by. I located Lis (http://www.ssisc.org/lis/) but it seems > more of a solver then a benchmarking program. > I would suggest using PETSc. It is a solvers library rather than a contrived benchmark suite, but the examples give you access to many different matrices and you can use many different formats without changing the code. If you run with -log_summary, you will get a useful table showing the performance of different operations (time/balance/communication/reductions/flops/etc). Also note that SpMV is usually not an end in its own right, usually it is part of a preconditioned Krylov iteration, so the performance of all the pieces matter. If you are concerned with absolute performance then you should consider using petsc-dev since it tends to have better memory performance due to software prefetch. This is important for good reuse of high-level caches since otherwise the matrix entries flush out the useful stuff. It usually makes between a 20 and 30% improvement, a bit more for some symmetric and triangular kernels. Many of the sparse matrix kernels did not have software prefetch as of the 3.1 release. Remember: "The easiest way to make software scalable is to make it sequentially inefficient." (Gropp, 1999)
Re: [OMPI users] memcpy overlap in ompi_ddt_copy_content_same_ddt and glibc 2.12
On Thu, Nov 11, 2010 at 12:36, Number Cruncher wrote: > However as commented here: > https://bugzilla.redhat.com/show_bug.cgi?id=638477#c86 the valgrind memcpy > implementation is overlap-safe. > Yes, of course. That's how the bug in Open MPI was originally detected. Of course you can't do production runs with valgrind. Are you using an Intel Nehalem-class CPU? > No, Core 2 Duo and Opteron (but the Opterons still have older glibc). Reverse memcpy must only be turned on for Nehalem. Jed
Re: [OMPI users] memcpy overlap in ompi_ddt_copy_content_same_ddt and glibc 2.12
On Wed, Nov 10, 2010 at 22:08, e-mail number.cruncher < number.crunc...@ntlworld.com> wrote: > In short, someone from Intel submitted a glibc patch that does faster > memcpy's on e.g. Intel i7, respects the ISO C definition, but does > things backwards. > However, the commit message and mailing list, as far as I can tell, does not explain how the implementations were benchmarked. Linus claims that his (entirely trivial) implementation matches or beats the new one. If indeed the performance gains claimed by Lu (2X to 4X) are real, then the old implementation must have been truly horrible (as stated by Agner Fog in http://sourceware.org/ml/libc-help/2008-08/msg7.html). I'd like to see the benchmark results demonstrating that the backward memcpy is really faster than forward. > I think any software that ignores the ISO warning > "If copying takes place between objects that overlap, the behavior is > undefined" needs fixing. > Absolutely, it is incorrect and should be fixed. Jed
Re: [OMPI users] memcpy overlap in ompi_ddt_copy_content_same_ddt and glibc 2.12
On Wed, Nov 10, 2010 at 18:25, Jed Brown wrote: > Is the memcpy-back code ever executed when called as memcpy()? I can't > imagine why it would be, but it would make plenty of sense to use it inside > memmove when the destination is at a higher address than the source. Apparently the backward memcpy is actually used, but I still don't know why and neither does Linus: https://bugzilla.redhat.com/show_bug.cgi?id=638477#c46 Jed
Re: [OMPI users] memcpy overlap in ompi_ddt_copy_content_same_ddt and glibc 2.12
On Wed, Nov 10, 2010 at 18:11, Number Cruncher wrote: > Just some observations from a concerned user with a temperamental Open MPI > program (1.4.3): > > Fedora 14 (just released) includes glibc-2.12 which has optimized versions > of memcpy, including a copy backward. > > http://sourceware.org/git/?p=glibc.git;a=commitdiff;h=6fb8cbcb58a29fff73eb2101b34caa19a7f88eba Is the memcpy-back code ever executed when called as memcpy()? I can't imagine why it would be, but it would make plenty of sense to use it inside memmove when the destination is at a higher address than the source. Jed
Re: [OMPI users] Open MPI data transfer error
On Sat, Nov 6, 2010 at 18:00, Jack Bryan wrote: > Thanks, > > About my MPI program bugs: > > I used GDB and got the error: > > Program received signal SIGSEGV, Segmentation fault. > 0: 0x003a31c62184 in fwrite () from /lib64/libc.so.6 > Clearly fwrite was called with invalid parameters, but you don't give enough information for anyone to explain why. Compile your program with debugging symbols and print the whole stack trace, e.g. with "backtrace full". Also try valgrind. > class CNSGA2 > { > allocate mem for var; > some deallocate statement; > some pointers; > evaluate(); // it is a function > } > This isn't even close to valid code since you can't have statements in the suggested scope. main() > { > CNSGA2* nsga2a = new CNSGA2(true); // true or false are only for different > constructors > CNSGA2* nsga2b = new CNSGA2(false); > if (myRank == 0) // scope1 > { > initialize the objects of nsga2a or nsga2b; > } > broadcast some parameters, which are got from scope1. > > According to the parameters, define a datatype (myData) so that all workers > use that to do recv and send. > > if (myRank == 0) // scope2 > { > send out myData to workers by the datatype defined above; > } > if (myRank != 0) > { > newCNSGA2 myNsga2; > recv data from master and work on the recved data; > myNsga2.evaluate(recv data); > send back results; > } > > } > According to the above, rank 0 never receives the results from before. You should paste valid code. Jed
[OMPI users] Open MPI 1.5 is not detecting oversubscription
Previous versions would set mpi_yield_when_idle automatically when oversubscribing a node. I assume this behavior was not intentionally changed, but the parameter is not being set in cases of oversubscription, with or without an explicit hostfile. Jed
Re: [OMPI users] Open MPI program cannot complete
On Mon, Oct 25, 2010 at 19:35, Jack Bryan wrote: > I have to use #PBS to submit any jobs in my cluster. > I cannot use command line to hang a job on my cluster. > You don't need a cluster to run MPI jobs, can you run the job on whatever you development machine is? Does it hang there? PBS interactive jobs are started with qsub -I. > > Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid > ZOMBIE_PID) in the script ? > On the line after "mpirun ...", assuming that control returns to there after the hang. You didn't answer whether that was the case. > And how to get ZOMBIE_PID from the script ? > Simplest is "pgrep $COMMAND", or use ps. Jed
Re: [OMPI users] Open MPI program cannot complete
On Mon, Oct 25, 2010 at 19:07, Jack Bryan wrote: > I need to use #PBS parallel job script to submit a job on MPI cluster. > Is it not possible to reproduce locally? Most clusters have a way to submit an interactive job (which would let you start this thing and then inspect individual processes). Ashley's Padb suggestion will certainly be better in a non-interactive environment. > Where should I put the (gdb --batch -ex 'bt full' -ex 'info reg' -pid > ZOMBIE_PID) in the script ? > Is control returning to your script after rank 0 has exited? In that case, you can just put this on the next line. > How to get the ZOMBIE_PID ? > "ps" from the command line, or getpid() from C code. Jed
Re: [OMPI users] Open MPI program cannot complete
On Mon, Oct 25, 2010 at 18:26, Jack Bryan wrote: > Thanks, the problem is still there. This really doesn't prove that there are no outstanding asynchronous requests, but perhaps you know that there are not, despite not being able to post a complete test case here. I suggest attaching a debugger and getting a stack trace from the zombies (gdb --batch -ex 'bt full' -ex 'info reg' -pid ZOMBIE_PID). Jed
Re: [OMPI users] Build failure with OMPI-1.5 (clang-2.8, gcc-4.5.1 with debug options)
On Fri, Oct 15, 2010 at 01:26, Jed Brown wrote: > I'll report the bug http://llvm.org/bugs/show_bug.cgi?id=8383
Re: [OMPI users] Build failure with OMPI-1.5 (clang-2.8, gcc-4.5.1 with debug options)
On Fri, Oct 15, 2010 at 00:38, Jeff Squyres (jsquyres) wrote: > Huh. Can you make V=1 to build libmpi and use the same kind of options to > build your sample library? > Make log here http://59A2.org/files/openmpi-1.5-clang-make.log After some digging, this looks like a clang bug. First, from the comments on http://llvm.org/bugs/show_bug.cgi?id=3679 there seems to be some resistance to the #pragma weak g2 = g3, but since these things work with clang-2.8, that isn't the whole story. Indeed, #pragma GCC visibility push(default) #pragma weak fake = real #pragma GCC visibility pop does not expose the symbol "fake". This must be a bug, but an arguably better way to set up the aliasing is int fake(int i) __attribute__((weak, alias("real"))); which does work. I'll report the bug, but maybe Open MPI could use a more complete test for visibility working with weak aliasing? Or just don't support clang-2.8. Jed
Re: [OMPI users] Build failure with OMPI-1.5 (clang-2.8, gcc-4.5.1 with debug options)
On Thu, Oct 14, 2010 at 23:53, Jeff Squyres wrote: > The configure test essentially looks like this -- could you try this > manually and see what happens? > > cat > conftest_weak.h < int real(int i); > int fake(int i); > EOF > > cat > conftest_weak.c < #include "conftest_weak.h" > #pragma weak fake = real > int real(int i) { return i; } > EOF > > cat > conftest.c < #include "conftest_weak.h" > int main() { return fake(3); } > EOF > > # Try the compile > clang $CFLAGS -c conftest_weak.c > clang $CFLAGS conftest.c conftest_weak.o -o conftest $LDFLAGS $LIBS > > The configure test rules that weak symbol support is there if both compiler > invocations return an exit status of 0. > They exit 0 and $ nm conftest |g 'real|fake' 004004a0 W fake 004004a0 T real so it looks like that is working fine. It also works fine when I stuff it into a shared library: $ clang -c -fPIC conftest_weak.c $ clang -shared -fPIC conftest.c conftest_weak.o -o conftest.so $ nm conftest.so |g 'real|fake' 05a0 W fake 05a0 T real Jed
Re: [OMPI users] Build failure with OMPI-1.5 (clang-2.8, gcc-4.5.1 with debug options)
On Thu, Oct 14, 2010 at 23:31, Jeff Squyres wrote: > Strange, because I see > /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/../../../.libs/libmpi.so > explicitly listed in the link line, which should contain MPI_Abort. Can you > nm on that file and ensure that it is actually listed there? > $ nm -D /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/../../../.libs/libmpi.so |grep MPI_Abort 00074380 T PMPI_Abort In contrast, with gcc: $ nm -D /home/jed/src/openmpi-1.5/bgcc/ompi/contrib/vt/vt/../../../.libs/libmpi.so |grep MPI_Abort 000712d0 W MPI_Abort 000712d0 T PMPI_Abort Weak symbol issue, I don't know how clang is different in this regard. Jed
Re: [OMPI users] Build failure with OMPI-1.5 (clang-2.8, gcc-4.5.1 with debug options)
On Thu, Oct 14, 2010 at 22:36, Jeff Squyres wrote: > On Oct 11, 2010, at 4:50 PM, Jed Brown wrote: > > > Note that this is an out-of-source build. > > > > $ ../configure --enable-debug --enable-mem-debug > --prefix=/home/jed/usr/ompi-1.5-clang CC=clang CXX=clang++ > > $ make > > [...] > > CXXLD vtunify-mpi > > vtunify_mpi-vt_unify_mpi.o: In function `VTUnify_MPI_Abort': > > > /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/tools/vtunify/mpi/../../../../../../../../ompi/contrib/vt/vt/tools/vtunify/mpi/vt_unify_mpi.c:63: > undefined reference to `MPI_Abort' > > Well this is disappointing. :-\ > > Can you "make V=1" so that we can see the command line here that is > failing? > libtool: link: clang++ -DVT_MPI -g -finline-functions -pthread -o .libs/vtunify-mpi vtunify_mpi-vt_unify_mpi.o vtunify_mpi-vt_unify.o vtunify_mpi-vt_unify_defs.o vtunify_mpi-vt_unify_defs_hdlr.o vtunify_mpi-vt_unify_events.o vtunify_mpi-vt_unify_events_hdlr.o vtunify_mpi-vt_unify_markers.o vtunify_mpi-vt_unify_markers_hdlr.o vtunify_mpi-vt_unify_stats.o vtunify_mpi-vt_unify_stats_hdlr.o vtunify_mpi-vt_unify_tkfac.o ../../../util/.libs/libutil.a ../../../extlib/otf/otflib/.libs/libotf.so -lz -L/home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/../../../.libs /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/../../../.libs/libmpi.so -ldl -lnsl -lutil -lm -pthread -Wl,-rpath -Wl,/home/jed/usr/ompi-1.5-clang/lib vtunify_mpi-vt_unify_mpi.o: In function `VTUnify_MPI_Abort': /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/tools/vtunify/mpi/../../../../../../../../ompi/contrib/vt/vt/tools/vtunify/mpi/vt_unify_mpi.c:63: undefined reference to `MPI_Abort' > FWIW, this looks like a problem that is self-contained in VampirTrace, so > you can likely get a working build with: > > ./configure --enable-contrib-no-build=vt ... > > > Leaving out the debugging flags gets me further (no compilation error, > just this link error): > > > > $ ../configure --prefix=/home/jed/usr/ompi-1.5-clang CC=clang CXX=clang++ > > $ make > > [...] > > CCLD libutil.la > > ar: > /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/util/.libs/libutil.a: No > such file or directory > > make[5]: *** [libutil.la] Error 9 > > That's a weird one -- it should be *creating* that library, so I'm not sure > why it would complain that the library doesn't exist...? This could be a > red herring, though -- perhaps some oddity in your tree and/or > filesystem...? (I've seen this kind of thing before such that a "make > distclean" fixed the issue, I think) > Sure enough, using a new build directory, I get the same error as above: libtool: link: clang++ -DVT_MPI -O3 -DNDEBUG -finline-functions -pthread -o .libs/vtunify-mpi vtunify_mpi-vt_unify_mpi.o vtunify_mpi-vt_unify.o vtunif y_mpi-vt_unify_defs.o vtunify_mpi-vt_unify_defs_hdlr.o vtunify_mpi-vt_unify_events.o vtunify_mpi-vt_unify_events_hdlr.o vtunify_mpi-vt_unify_markers.o vtunify_mpi-vt_unify_markers_hdlr.o vtunify_mpi-vt_unify_stats.o vtunify_mpi-vt_unify_stats_hdlr.o vtunify_mpi-vt_unify_tkfac.o ../../../util/.libs/ libutil.a ../../../extlib/otf/otflib/.libs/libotf.so -lz -L/home/jed/src/openmpi-1.5/bclang-nodbg/ompi/contrib/vt/vt/../../../.libs /home/jed/src/open mpi-1.5/bclang-nodbg/ompi/contrib/vt/vt/../../../.libs/libmpi.so -ldl -lnsl -lutil -lm -pthread -Wl,-rpath -Wl,/home/jed/usr/ompi-1.5-clang-nodbg/lib vtunify_mpi-vt_unify_mpi.o: In function `VTUnify_MPI_Abort': ../../../../../../../../ompi/contrib/vt/vt/tools/vtunify/mpi/vt_unify_mpi.c:(.text+0xa): undefined reference to `MPI_Abort' Grab config.log for this case here: http://59A2.org/files/openmpi-1.5-clang-config.log > > I also get this last failure with gcc-4.5.1, but only with the debug > flags: > > > > $ ../configure --enable-debug --enable-mem-debug > --prefix=/home/jed/usr/ompi-1.5-gcc CC=gcc CXX=g++ > > $ make > > [...] > > Making all in util > > CC libutil_la-installdirs.lo > > CCLD libutil.la > > ar: > /home/jed/src/openmpi-1.5/bgcc/ompi/contrib/vt/vt/util/.libs/libutil.a: No > such file or directory > > Same error. Weird. Can you "make V=1" here, too? This one completes with a clean build directory, reconfiguring from a non-debug build must have caused this issue the first time around. Jed
[OMPI users] Build failure with OMPI-1.5 (clang-2.8, gcc-4.5.1 with debug options)
Note that this is an out-of-source build. $ ../configure --enable-debug --enable-mem-debug --prefix=/home/jed/usr/ompi-1.5-clang CC=clang CXX=clang++ $ make [...] CXXLD vtunify-mpi vtunify_mpi-vt_unify_mpi.o: In function `VTUnify_MPI_Abort': /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/tools/vtunify/mpi/../../../../../../../../ompi/contrib/vt/vt/tools/vtunify/mpi/vt_unify_mpi.c:63: undefined reference to `MPI_Abort' vtunify_mpi-vt_unify_mpi.o: In function `VTUnify_MPI_Address': /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/tools/vtunify/mpi/../../../../../../../../ompi/contrib/vt/vt/tools/vtunify/mpi/vt_unify_mpi.c:74: undefined reference to `MPI_Address' vtunify_mpi-vt_unify_mpi.o: In function `VTUnify_MPI_Barrier': /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/tools/vtunify/mpi/../../../../../../../../ompi/contrib/vt/vt/tools/vtunify/mpi/vt_unify_mpi.c:86: undefined reference to `MPI_Barrier' vtunify_mpi-vt_unify_mpi.o: In function `VTUnify_MPI_Bcast': /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/tools/vtunify/mpi/../../../../../../../../ompi/contrib/vt/vt/tools/vtunify/mpi/vt_unify_mpi.c:101: undefined reference to `MPI_Bcast' vtunify_mpi-vt_unify_mpi.o: In function `VTUnify_MPI_Comm_size': /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/tools/vtunify/mpi/../../../../../../../../ompi/contrib/vt/vt/tools/vtunify/mpi/vt_unify_mpi.c:115: undefined reference to `MPI_Comm_size' vtunify_mpi-vt_unify_mpi.o: In function `VTUnify_MPI_Comm_rank': /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/tools/vtunify/mpi/../../../../../../../../ompi/contrib/vt/vt/tools/vtunify/mpi/vt_unify_mpi.c:127: undefined reference to `MPI_Comm_rank' vtunify_mpi-vt_unify_mpi.o: In function `VTUnify_MPI_Finalize': /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/tools/vtunify/mpi/../../../../../../../../ompi/contrib/vt/vt/tools/vtunify/mpi/vt_unify_mpi.c:138: undefined reference to `MPI_Finalize' vtunify_mpi-vt_unify_mpi.o: In function `VTUnify_MPI_Init': /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/tools/vtunify/mpi/../../../../../../../../ompi/contrib/vt/vt/tools/vtunify/mpi/vt_unify_mpi.c:149: undefined reference to `MPI_Init' vtunify_mpi-vt_unify_mpi.o: In function `VTUnify_MPI_Pack': /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/tools/vtunify/mpi/../../../../../../../../ompi/contrib/vt/vt/tools/vtunify/mpi/vt_unify_mpi.c:165: undefined reference to `MPI_Pack' vtunify_mpi-vt_unify_mpi.o: In function `VTUnify_MPI_Pack_size': /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/tools/vtunify/mpi/../../../../../../../../ompi/contrib/vt/vt/tools/vtunify/mpi/vt_unify_mpi.c:180: undefined reference to `MPI_Pack_size' vtunify_mpi-vt_unify_mpi.o: In function `VTUnify_MPI_Recv': /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/tools/vtunify/mpi/../../../../../../../../ompi/contrib/vt/vt/tools/vtunify/mpi/vt_unify_mpi.c:197: undefined reference to `MPI_Recv' vtunify_mpi-vt_unify_mpi.o: In function `VTUnify_MPI_Send': /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/tools/vtunify/mpi/../../../../../../../../ompi/contrib/vt/vt/tools/vtunify/mpi/vt_unify_mpi.c:218: undefined reference to `MPI_Send' vtunify_mpi-vt_unify_mpi.o: In function `VTUnify_MPI_Type_commit': /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/tools/vtunify/mpi/../../../../../../../../ompi/contrib/vt/vt/tools/vtunify/mpi/vt_unify_mpi.c:230: undefined reference to `MPI_Type_commit' vtunify_mpi-vt_unify_mpi.o: In function `VTUnify_MPI_Type_free': /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/tools/vtunify/mpi/../../../../../../../../ompi/contrib/vt/vt/tools/vtunify/mpi/vt_unify_mpi.c:242: undefined reference to `MPI_Type_free' vtunify_mpi-vt_unify_mpi.o: In function `VTUnify_MPI_Type_struct': /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/tools/vtunify/mpi/../../../../../../../../ompi/contrib/vt/vt/tools/vtunify/mpi/vt_unify_mpi.c:270: undefined reference to `MPI_Type_struct' vtunify_mpi-vt_unify_mpi.o: In function `VTUnify_MPI_Unpack': /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/tools/vtunify/mpi/../../../../../../../../ompi/contrib/vt/vt/tools/vtunify/mpi/vt_unify_mpi.c:300: undefined reference to `MPI_Unpack' collect2: ld returned 1 exit status clang: error: linker (via gcc) command failed with exit code 1 (use -v to see invocation) make[7]: *** [vtunify-mpi] Error 1 Leaving out the debugging flags gets me further (no compilation error, just this link error): $ ../configure --prefix=/home/jed/usr/ompi-1.5-clang CC=clang CXX=clang++ $ make [...] CCLD libutil.la ar: /home/jed/src/openmpi-1.5/bclang/ompi/contrib/vt/vt/util/.libs/libutil.a: No such file or directory make[5]: *** [libutil.la] Error 9 I also get this last failure with gcc-4.5.1, but only with the debug flags: $ ../configure --enable-debug --enable-mem-debug --prefix=/home/jed/usr/ompi-1.5-gcc CC=gcc CXX=g++ $ make [...] Making all in util CC libutil_la-installdirs.lo CCLD libutil.la ar: /home/jed/src/openmpi-1.5/bgc
Re: [OMPI users] OpenMPI Run-Time "Freedom" Question
Or OMPI_CC=icc-xx.y mpicc ... Jed On Aug 12, 2010 5:18 PM, "Ralph Castain" wrote: On Aug 12, 2010, at 7:04 PM, Michael E. Thomadakis wrote: > On 08/12/10 18:59, Tim Prince wrote: >>... The "easy" way to accomplish this would be to: (a) build OMPI with whatever compiler you decide to use as a "baseline" (b) do -not- use the wrapper compiler to build the application. Instead, do "mpicc --showme" (or whatever language equivalent you want) to get the compile line, substitute your "new" compiler library for the "old" one, and then execute the resulting command manually. If you then set your LD_LIBRARY_PATH to the "new" libs, it might work - but no guarantees. Still, you could try it - and if it worked, you could always just explain that this is a case-by-case situation, and so it -could- break with other compiler combinations. Critical note: the app developers would have to validate the code with every combination! Otherwise, correct execution will be a complete crap-shoot - just because the app doesn't abnormally terminate does -not- mean it generated a correct result! > Thanks for the information on this. We indeed use Intel Compiler set 11.1.XXX + OMPI 1.4.1 and ... ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Do MPI calls ever sleep?
On Wed, 21 Jul 2010 15:20:24 -0400, David Ronis wrote: > Hi Jed, > > Thanks for the reply and suggestion. I tried adding -mca > yield_when_idle 1 (and later mpi_yield_when_idle 1 which is what > ompi_info reports the variable as) but it seems to have had 0 effect. > My master goes into fftw planning routines for a minute or so (I see the > threads being created), but the overall usage of the slaves remains > close to 100% during this time. Just to be sure, I put the slaves into > a MPI_Barrier(MPI_COMM_WORLD) while they were waiting for the fftw > planner to finish. It also didn't help. They still spin (instead of using e.g. select()), but call sched_yield() so should only be actively spinning when nothing else is trying to run. Are you sure that the planner is always running in parallel? What OS and OMPI version are you using? Jed
Re: [OMPI users] Do MPI calls ever sleep?
On Wed, 21 Jul 2010 14:10:53 -0400, David Ronis wrote: > Is there another MPI routine that polls for data and then gives up its > time-slice? You're probably looking for the runtime option -mca yield_when_idle 1. This will slightly increase latency, but allows other threads to run without competing with the spinning MPI. Jed
Re: [OMPI users] openmpi v1.5?
On Mon, 19 Jul 2010 15:24:32 -0400, Jeff Squyres wrote: > I'm actually waiting for *1* more bug fix before we consider 1.5 "complete". I see this going through, but would it be possible to change the size of the _count field in ompi_status_public_t now so that this bug can be fixed without ABI breakage? https://svn.open-mpi.org/trac/ompi/ticket/2241 Note that thet 1.4.3 milestone doesn't make sense since it can't be fixed without an ABI change. Jed
Re: [OMPI users] Ok, I've got OpenMPI set up, now what?!
On Mon, 19 Jul 2010 13:33:01 -0600, Damien Hocking wrote: > It does. The big difference is that MUMPS is a 3-minute compile, and > PETSc, erm, isn't. It's..longer... FWIW, PETSc takes less than 3 minutes to build (after configuration) for me (I build it every day). Building MUMPS (with dependencies) is automatic with PETSc's --download-{blacs,scalapack,mumps}, but is involved to do by hand (all three require editing makefiles). I know people that have configured PETSc just to build code which calls MUMPS directly (without PETSc). :-) Jed
Re: [OMPI users] openmpi v1.5?
On Mon, 19 Jul 2010 15:16:59 -0400, Michael Di Domenico wrote: > Since I am a SVN neophyte can anyone tell me when openmpi 1.5 is > scheduled for release? https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.5 > And whether the Slurm srun changes are going to make in? https://svn.open-mpi.org/trac/ompi/wiki/v1.5/planning Jed
Re: [OMPI users] Highly variable performance
On Thu, 15 Jul 2010 13:03:31 -0400, Jeff Squyres wrote: > Given the oversubscription on the existing HT links, could contention > account for the difference? (I have no idea how HT's contention > management works) Meaning: if the stars line up in a given run, you > could end up with very little/no contention and you get good > bandwidth. But if there's a bit of jitter, you could end up with > quite a bit of contention that ends up cascading into a bunch of > additional delay. What contention? Many sockets needing to access memory on another socket via HT links? Then yes, perhaps that could be a lot. As show in the diagram, it's pretty non-uniform, and if, say sockets 0, 1, and 3 all found memory on socket 0 (say socket 2 had local memory), then there are two ways for messages to get from 3 to 0 (via 1 or via 2). I don't know if there is hardware support to re-route to avoid contention, but if not, then socket 3 could be sharing the 1->0 HT link (which has max throughput of 8 GB/s, therefore 4 GB/s would be available per socket, provided it was still operating at peak). Note that this 4 GB/s is still less than splitting the 10.7 GB/s three ways. > I fail to see how that could add up to 70-80 (or more) seconds of > difference -- 13 secs vs. 90+ seconds (and more), though... 70-80 > seconds sounds like an IO delay -- perhaps paging due to the ramdisk > or somesuch...? That's a SWAG. This problem should have had significantly less resident than would cause paging, but these were very short jobs so a relatively small amount of paging would cause a big performance hit. We have also seen up to a factor of 10 variability in longer jobs (e.g. 1 hour for a "fast" run), with larger working sets, but once the pages are faulted, this kernel (2.6.18 from RHEL5) won't migrate them around, so even if you eventually swap out all the ramdisk, pages faulted before and after will be mapped to all sorts of inconvenient places. But, I don't have any systematic testing with a guaranteed clean ramdisk, and I'm not going to overanalyze the extra factors when there's an understood factor of 3 hanging in the way. I'll give an update if there is any news. Jed
Re: [OMPI users] Highly variable performance
On Thu, 15 Jul 2010 09:36:18 -0400, Jeff Squyres wrote: > Per my other disclaimer, I'm trolling through my disastrous inbox and > finding some orphaned / never-answered emails. Sorry for the delay! No problem, I should have followed up on this with further explanation. > Just to be clear -- you're running 8 procs locally on an 8 core node, > right? These are actually 4-socket quad-core nodes, so there are 16 cores available, but we are only running on 8, -npersocket 2 -bind-to-socket. This was a greatly simplified case, but is still sufficient to show the variability. It tends to be somewhat worse if we use all cores of a node. (Cisco is an Intel partner -- I don't follow the AMD line > much) So this should all be local communication with no external > network involved, right? Yes, this was the greatly simplified case, contained entirely within a > > lsf.o240562 killed 8*a6200 > > lsf.o240563 9.2110e+01 8*a6200 > > lsf.o240564 1.5638e+01 8*a6237 > > lsf.o240565 1.3873e+01 8*a6228 > > Am I reading that right that it's 92 seconds vs. 13 seconds? Woof! Yes, an the "killed" means it wasn't done after 120 seconds. This factor of 10 is about the worst we see, but of course very surprising. > Nice and consistent, as you mentioned. And I assume your notation > here means that it's across 2 nodes. Yes, the Quadrics nodes are 2-socket dual core, so 8 procs needs two nodes. The rest of your observations are consistent with my understanding. We identified two other issues, neither of which accounts for a factor of 10, but which account for at least a factor of 3. 1. The administrators mounted a 16 GB ramdisk on /scratch, but did not ensure that it was wiped before the next task ran. So if you got a node after some job that left stinky feces there, you could effectively only have 16 GB (before the old stuff would be swapped out). More importantly, the physical pages backing the ramdisk may not be uniformly distributed across the sockets, and rather than preemptively swap out those old ramdisk pages, the kernel would find a page on some other socket (instead of locally, this could be confirmed, for example, by watching the numa_foreign and numa_miss counts with numastat). Then when you went to use that memory (typically in a bandwidth-limited application), it was easy to have 3 sockets all waiting on one bus, thus taking a factor of 3+ performance hit despite a resident set much less than 50% of the available memory. I have a rather complete analysis of this in case someone is interested. Note that this can affect programs with static or dynamic allocation (the kernel looks for local pages when you fault it, not when you allocate it), the only way I know of to circumvent the problem is to allocate memory with libnuma (e.g. numa_alloc_local) which will fail if local memory isn't available (instead of returning and subsequently faulting remote pages). 2. The memory bandwidth is 16-18% different between sockets, with sockets 0,3 being slow and sockets 1,2 having much faster available bandwidth. This is fully reproducible and acknowledged by Sun/Oracle, their response to an early inquiry: http://59A2.org/files/SunBladeX6440STREAM-20100616.pdf I am not completely happy with this explanation because the issue persists even with full software prefetch, packed SSE2, and non-temporal stores; as long as the working set does not fit within (per-socket) L3. Note that the software prefetch allows for several hundred cycles of latency, so the extra hop for snooping shouldn't be a problem. If the working set fits within L3, then all sockets are the same speed (and of course much faster due to improved bandwidth). Some disassembly here: http://gist.github.com/476942 The three with prefetch and movntpd run within 2% of each other, the other is much faster within cache and much slower when it breaks out of cache (obviously). The performance numbers are higher than with the reference implementation (quoted in Sun/Oracle's repsonse), but (run with taskset to each of the four sockets): Triad: 5842.5814 0.0329 0.0329 0.0330 Triad: 6843.4206 0.0281 0.0281 0.0282 Triad: 6827.6390 0.0282 0.0281 0.0283 Triad: 5862.0601 0.0329 0.0328 0.0331 This is almost exclusively due to the prefetching, the packed arithmetic is almost completely inconsequential when waiting on memory bandwidth. Jed
Re: [OMPI users] EXTERNAL: Re: MPI_GET beyond 2 GB displacement
On Thu, 8 Jul 2010 09:53:11 -0400, Jeff Squyres wrote: > > Do you "use mpi" or the F77 interface? > > It shouldn't matter; both the Fortran module and mpif.h interfaces are the > same. Yes, but only the F90 version can do type checking, the function prototypes are not present in mpif.h. The truncation is an internal issue, unrelated to user code having compatible types (since I can reproduce the issue from C). I'm just confused by how passing an int64 when an int32 is expected could work from Fortran on a big-endian system (it would likely work from C when the argument is passed by value in a register), using the F90 module would enable the compiler to do type checking and should hilight any type mismatches. Jed
Re: [OMPI users] EXTERNAL: Re: MPI_GET beyond 2 GB displacement
On Wed, 07 Jul 2010 17:34:44 -0600, "Price, Brian M (N-KCI)" wrote: > Jed, > > The IBM P5 I'm working on is big endian. Sorry, that didn't register. The displ argument is MPI_Aint which is 8 bytes (at least on LP64, probably also on LLP64), so your use of kind=8 for that is certainly correct. The count argument is a plain int, I don't see how your code could be correct if you pass in an 8-byte int there when it expects a 4-byte int (since the upper 4 bytes would be used on a big-endian system). > The test program I'm using is written in Fortran 90 (as stated in my > question). Do you "use mpi" or the F77 interface? > I imagine this is indeed a library issue, but I still don't understand what > I've done wrong here. I can reproduce this in C on x86-64, even with displ much smaller than 2^31 (e.g. by setting displ_unit=4). Apparently Open MPI multiplies displ*displ_unit and stuffs the result in an int (somewhere in the implementation), MPICH2 works correctly for me with large displacements. https://svn.open-mpi.org/trac/ompi/ticket/2472 Jed
Re: [OMPI users] EXTERNAL: Re: MPI_GET beyond 2 GB displacement
On Wed, 07 Jul 2010 15:51:41 -0600, "Price, Brian M (N-KCI)" wrote: > Jeff, > > I understand what you've said about 32-bit signed INTs, but in my program, > the displacement variable that I use for the MPI_GET call is a 64-bit INT > (KIND = 8). The MPI Fortran bindings expect a standard int, your program is only working because your system is little endian so the first 4 bytes are the low bytes (correct for numbers less than 2^31), it would be completely broken on a big endian system. This is a library issue, you can't fix it by using different sized ints in your program and you would see compiler errors due to the type mismatch if you were using Fortran 90 (which is capable of some type checking). Jed
Re: [OMPI users] Highly variable performance
Following up on this, I have partial resolution. The primary culprit appears to be stale files in a ramdisk non-uniformly distributed across the sockets, thus interactingly poorly with NUMA. The slow runs invariably have high numa_miss and numa_foreign counts. I still have trouble making it explain up to a factor of 10 degredation, but it certainly explains a factor of 3. Jed
Re: [OMPI users] Address not mapped segmentation fault with 1.4.2 ...
Just a guess, but you could try the updated patch here https://svn.open-mpi.org/trac/ompi/ticket/2431 Jed
[OMPI users] Highly variable performance
I'm investigating some very large performance variation and have reduced the issue to a very simple MPI_Allreduce benchmark. The variability does not occur for serial jobs, but it does occur within single nodes. I'm not at all convinced that this is an Open MPI-specific issue (in fact the same variance is observed with MVAPICH2 which is an available, but not "recommended", implementation on that cluster) but perhaps someone here can suggest steps to track down the issue. The nodes of interest are 4-socket Opteron 8380 (quad core, 2.5 GHz), connected with QDR InfiniBand. The benchmark loops over MPI_Allgather(localdata,nlocal,MPI_DOUBLE,globaldata,nlocal,MPI_DOUBLE,MPI_COMM_WORLD); with nlocal=1 (80 KiB messages) 1 times, so it normally runs in a few seconds. Open MPI 1.4.1 was compiled with gcc-4.3.3, and this code was built with mpicc -O2. All submissions were 8 process, timing and host results are presented below in chronological order. The jobs were run with 2-minute time limits (to get through the queue easily) jobs are marked "killed" if they went over this amount of time. Jobs were usually submitted in batches of 4. The scheduler is LSF-7.0. The HOST field indicates the node that was actually used, a6* nodes are of the type described above, a2* nodes are much older (2-socket Opteron 2220 (dual core, 2.8 GHz)) and use a Quadrics network, the timings are very reliable on these older nodes. When the issue first came up, I was inclined to blame memory bandwidith issues with other jobs, but the variance is still visible when our job runs on exactly a full node, present regardless of affinity settings, and events that don't require communication are well-balanced in both small and large runs. I then suspected possible contention between transport layers, ompi_info gives MCA btl: parameter "btl" (current value: "self,sm,openib,tcp", data source: environment) so the timings below show many variations of restricting these values. Unfortunately, the variance is large for all combinations, but I find it notable that -mca btl self,openib is reliably much slower than self,tcp. Note that some nodes are used in multiple runs, yet there is no strict relationship where some nodes are "fast", for instance, a6200 is very slow (6x and more) in the first set, then normal on the subsequent test. Nevertheless, when the same node appears in temporally nearby tests, there seems to be a correlation (though there is certainly not enough data here to establish that with confidence). As a final observation, I think the performance in all cases is unreasonably low since the same test on a (unrelated to the cluster) 2-socket Opteron 2356 (quad core, 2.3 GHz) always takes between 9.75 and 10.0 seconds, 30% faster than the fastest observations on the cluster nodes with faster cores and memory. # JOB TIME (s) HOST ompirun lsf.o240562 killed 8*a6200 lsf.o240563 9.2110e+01 8*a6200 lsf.o240564 1.5638e+01 8*a6237 lsf.o240565 1.3873e+01 8*a6228 ompirun -mca btl self,sm lsf.o240574 1.6916e+01 8*a6237 lsf.o240575 1.7456e+01 8*a6200 lsf.o240576 1.4183e+01 8*a6161 lsf.o240577 1.3254e+01 8*a6203 lsf.o240578 1.8848e+01 8*a6274 prun (quadrics) lsf.o240602 1.6168e+01 4*a2108+4*a2109 lsf.o240603 1.6746e+01 4*a2110+4*a2111 lsf.o240604 1.6371e+01 4*a2108+4*a2109 lsf.o240606 1.6867e+01 4*a2110+4*a2111 ompirun -mca btl self,openib lsf.o240776 3.1463e+01 8*a6203 lsf.o240777 3.0418e+01 8*a6264 lsf.o240778 3.1394e+01 8*a6203 lsf.o240779 3.5111e+01 8*a6274 ompirun -mca self,sm,openib lsf.o240851 1.3848e+01 8*a6244 lsf.o240852 1.7362e+01 8*a6237 lsf.o240854 1.3266e+01 8*a6204 lsf.o240855 1.3423e+01 8*a6276 ompirun lsf.o240858 1.4415e+01 8*a6244 lsf.o240859 1.5092e+01 8*a6237 lsf.o240860 1.3940e+01 8*a6204 lsf.o240861 1.5521e+01 8*a6276 lsf.o240903 1.3273e+01 8*a6234 lsf.o240904 1.6700e+01 8*a6206 lsf.o240905 1.4636e+01 8*a6269 lsf.o240906 1.5056e+01 8*a6234 ompirun -mca self,tcp lsf.o240948 1.8504e+01 8*a6234 lsf.o240949 1.9317e+01 8*a6207 lsf.o240950 1.8964e+01 8*a6234 lsf.o240951 2.0764e+01 8*a6207 ompirun -mca btl self,sm,openib lsf.o240998 1.3265e+01 8*a6269 lsf.o240999 1.2884e+01 8*a6269 lsf.o241000 1.3092e+01 8*a6234 lsf.o241001 1.3044e+01 8*a6269 ompirun -mca btl self,openib lsf.o241013 3.1572e+01 8*a6229 lsf.o241014 3.0552e+01 8*a6234 lsf.o241015 3.1813e+01 8*a6229 lsf.o241016 3.2514e+01 8*a6252 ompirun -mca btl self,sm lsf.o241044 1.3417e+01 8*a6234 lsf.o241045 killed 8*a6232 lsf.o241046 1.4626e+01 8*a6269 lsf.o241047 1.5060e+01 8*a6253 lsf.o241166 1.3179e+01 8*a6228 lsf.o241167 2.7759e+01 8*a6232 lsf.o241168 1.4224e+01 8*a6234 lsf.o241169 1.4825e+01 8*a6228 lsf.o241446 1.4896e+01 8*a6204 lsf.o241447 1.4960e+01 8*a6228 lsf.o241448 1.7622e+01 8*a6222 lsf.o241449 1.5112e+01 8*a6204 ompirun -mca btl self,tcp lsf.o241556 1.9135e+01 8*a6204 lsf.o241557 2.4365e+01 8*a6261 lsf.o241558
Re: [OMPI users] Solving SVD Using Lanczos Method Implementation
On Mon, 26 Apr 2010 22:30:15 +0700, long thai wrote: > Hi all. > > I'm trying to develop MPI program to solve SVD using Lanczos algorithms. > However, I have no idea how to do that. Somebody suggested to take a look at > http://www.netlib.org/scalapack/ but I cannot understand exactly what to > look. Morever, I know that *las2* is the popular library to solve SVD but > don't know how to use it in parallel computing. I recommend SLEPc http://www.grycap.upv.es/slepc/ There are plenty of examples and a variety of algorithms for scalable computation of SVDs. Jed
Re: [OMPI users] How to "guess" the incoming data type ?
On Sun, 25 Apr 2010 20:38:54 -0700, Eugene Loh wrote: > Could you encode it into the tag? This sounds dangerous. > Or, append a data type to the front of each message? This is the idea, unfortunately this still requires multiple messages for collectives (because you can't probe for a suitable buffer size, and no dynamic language will be happy with "the buffer can be anything as long as it's type is in this list and the total number of bytes is N"). This file is a pretty easy to read reference for a friendly MPI in dynamic language: http://mpi4py.googlecode.com/svn/trunk/src/MPI/pickled.pxi Note that mpi4py also exposes the low-level functionality. Numpy arrays can be sent without pickling: http://mpi4py.googlecode.com/svn/trunk/src/MPI/asbuffer.pxi Something that could be done to prevent packing in some cases is to define an MPI datatype for the send and receive types, but this will usually require an extra message because the receiver has to wire up an empty object that is ready to receive the incoming message. Jed
Re: [OMPI users] How to debug Open MPI programs with gdb
On Thu, 22 Apr 2010 13:11:49 +0200, "=?utf-8?b?0J3QtdC80LDRmtCwINCY0LvQuNGb?= (Nemanja Ilic)" wrote: > On the contrary when I debug with "mpirun -np 4 xterm -e gdb > my_mpi_application" the four debugger windows are started with > separate thread each, just as it should be. Since I will be using > debugger on a remote computer I can only run gdb in console mode. Can > anyone help me with this? An alternative to opening xterms (e.g. if that host isn't running an X server, you can't get X11 forwarding to work, or you just don't want xterms) is to use GNU "screen". It's basically the same command line, but it will open a screen terminal for each thread. When debugging multiple threads with xterms or screens, I recommend gdb's -ex 'break somewhere' -ex run --args ./app -args -for -your application to save you from entering commands into each terminal separately. Jed
Re: [OMPI users] 3D domain decomposition with MPI
On Fri, 12 Mar 2010 15:06:33 -0500, Gus Correa wrote: > Hi Cole, Jed > > I don't have much direct experience with PETSc. Disclaimer: I've been using PETSc for several years and also work on the library itself. > I mostly troubleshooted other people's PETSc programs, > and observed their performance. > What I noticed is: > 1) PETSc's learning curve is as steep if not steeper than MPI, and I think this depends strongly on what you want to do. Since the library is built on top of MPI, it's sort of trivially true since it's beneficial for the user to be familiar with collective semantics, and perhaps other MPI functionality depending on the level of control that they seek. That said, many PETSc users never call MPI directly. > 2) PETSc codes seem to be slower (or have more overhead) > than codes written directly in MPI. > Jed seems to have a different perception of PETSc, though, > and is more enthusiastic about it. > > Admittedly, I don't have any direct comparison > (i.e. the same exact code implemented via PETSc and via MPI), > to support what I said above. If you do find such a comparison, we'd like to see it. We expose a limited amount of interfaces that are known to perform/scale poorly, because users who are not concerned about scalability ask for them so often. These should be clearly marked, we'll fix the docs if this is not the case. Note that PETSc's neighbor updates use persistant nonblocking calls by default, but you can select alltoallw, one-sided, ready-send, synchronous sends, and a couple other options, with and without packing (choice at runtime). If you know of a faster way, we'd like to see it. Note that a default build is in debugging mode which activates lots of integrity checks, checks for memory corruption, etc., and is usually 2 or 3 times slower than a production build (--with-debugging=0). > OTOH, if you have a clean and good serial code already developed, > I think it won't be a big deal to parallelize it directly > with MPI, assuming that the core algorithm (your Gauss-Seidel solver) > fits the remaining code in a well structured way. This depends a lot on the structure of the serial code. Bill Gropp had a great quote in the last rce-cast (starts at 38:30, in response to Brock Palen's question about what to think about when designing a parallel program): I think the first thing they should keep in mind is to see whether they can succeed without using MPI. After all, one of the things that we try to do with MPI is to encourage the development of libraries. All too often we see people who are reinverting "PETSc-light" instead of just pulling up the library and using it. MPI enabled an entire parallel ecosystem for scientific software and the first thing you should do is see if you've already had someone else do the job for you. I think after that, if you actually have to write the code, then you have to confront the top-down versus bottom-up. And the next mistake that people make is they write the individual node code and then try to figure out how to glue it together to all of the other nodes. And we really feel that for many applications, what you want to do is to start by viewing your application as a global application, have global data structures, figure out how you decompose it, and then the code to coordinate the communication between them will be pretty obvious. And you can tell the difference between how an application was built, from whether it was top-down or bottom-up. [...] You want to think about how you decompose your data structures, how you think about them globally. Rather than saying, I need to think about everything in parallel, so I'll have all these little patches, and I'll compute on them, and then figure out how to stitch them together. If you were building a house, you'd start with a set of blueprints that give you a picture of what the whole house looks like. You wouldn't start with a bunch of tiles and say. "Well I'll put this tile down on the ground, and then I'll find a tile to go next to it." But all too many people try to build their parallel programs by creating the smallest possible tiles and then trying to have the structure of their code emerge from the chaos of all these little pieces. You have to have an organizing principle if you're going to survive making your code parallel. Jed
Re: [OMPI users] 3D domain decomposition with MPI
On Thu, 11 Mar 2010 12:44:01 -0500, "Cole, Derek E" wrote: > I am replying to this via the daily-digest message I got. Sorry it > wasn't sooner... I didn't realize I was getting replies until I got > the digest. Does anyone know how to change it so I get the emails as > you all send them? Log in at the bottom and edit options: http://www.open-mpi.org/mailman/listinfo.cgi/users > I am doing a Red-black Gauss-Seidel algorithm. Note that red-black Guss-Seidel is a terrible algorithm on cache-based hardware, it only makes sense on vector hardware. The reason for this is that the whole point is to approximate a dense action (the inverse of a matrix), but the red-black ordering causes this action to be purely local. A sequential ordering, on the other hand, is like a dense lower-triangular operation, which tends to be much closer to a real inverse. In parallel, you do these sequential sweeps on each process, and communicate when you're done. The memory performance will be twice as good, and the algorithm will converge in fewer iterations. > The ghost points are what I was trying to figure for moving this into > the 3rd dimension. Thanks for adding some concrete-ness to my idea of > exactly how much overhead is involved. The test domains I am computing > on are on the order of 100*50*50 or so. This is why I am trying to > limit the overhead of the MPI communication. I am in the process of > finding out exactly how big the domains may become, so that I can > adjust the algorithm accordingly. The difficulty is for small subdomains. If you have large subdomains, then parallel overhead will almost always be small. > If I understand what you mean by pencils versus books, I don't think > that will work for these types of calculations will it? Maybe a better > explanation of what you mean by a pencil versus a book. If I was going > to solve a sub-matrix of the XY plane for all Z-values, what is that > considered? That would be a "book" or "slab". I still recommend using PETSc rather than reproducing standard code to call MPI directly for this, it will handle all the decomposition and updates, and has the advantage that you'll be able to use much better algorithms than Gauss-Seidel. Jed
Re: [OMPI users] 3D domain decomposition with MPI
On Wed, 10 Mar 2010 22:25:43 -0500, Gus Correa wrote: > Ocean dynamics equations, at least in the codes I've seen, > normally use "pencil" decomposition, and are probably harder to > handle using 3D "chunk" decomposition (due to the asymmetry imposed by > gravity). There is also a lot to be said for the strength of coupling. Ocean codes do "tridiagonal solves" in columns, and these would no longer be trivially cheap (in fact the structure of the code would need to change) if the partition also broke up the vertical. Since the domain is so anisotropic and the coupling (at least aside from the barotropic mode) is so much stronger in the vertical than the horizontal, it is good to decompose with columns always kept together. In a fully implicit code, these column solves would quit being mandatory, but the availability of a "line smoother" for multigrid still favors this type of decomposition. Also note that in domain decomposition algorithms (like additive Schwarz, and balancing Neumann-Neumann), the asymptotics scale with the maximum number of subdomains required to cross the domain, and/or with the number of elements along the longest edge of a subdomain. This tends to favor partitioning in 3D, unless the physics/domain is sufficiently anisotropic to overcome this preference. Also depending on Derek's application, he may want to use a library like PETSc to handle the decomposition and updates. Certainly this is true if the application may ever need solvers; in my opinion, it is also true unless this is a toy problem being used to learn MPI. If you really want to write this stuff yourself, it's still worth looking at the discussion in PETSc user's manual. Jed
Re: [OMPI users] MPi Abort verbosity
On Wed, 24 Feb 2010 14:21:02 +0100, Gabriele Fatigati wrote: > Yes, of course, > > but i would like to know if there is any way to do that with openmpi See the error handler docs, e.g. MPI_Comm_set_errhandler. Jed
Re: [OMPI users] Similar question about MPI_Create_type
On Mon, 08 Feb 2010 14:42:15 -0500, Prentice Bisbal wrote: > I'll give that a try, too. IMHO, MPI_Pack/Unpack looks easier and less > error prone, but Pacheco advocates using derived types over > MPI_Pack/Unpack. I would recommend using derived types for big structures, or perhaps for long-lived medium-sized structures. If your structure is static (i.e. doesn't contain pointers), then derived types definitely make sense and allow you to use that type in collectives. > In my situation, rank 0 is reading in a file containing all the coords. > So even if other ranks don't have the data, I still need to create the > structure on all the nodes, even if I don't populate it with data? You're populating it by receiving data. MPI can't allocate the space for you, so you have to it up. > To clarify: I thought adding a similar structure, b_point in rank 1 > would be adequate to receive the data from rank 0 You have allocated memory by the time you call MPI_Recv, but you were passing an undefined value to MPI_Address, and you certainly can't base derived_type an a_point and use it to receive into b_point. It would be fine to receive into a_point on rank 1, but whatever you do, derived_type has to be created correctly on each process. Jed
Re: [OMPI users] Similar question about MPI_Create_type
On Mon, 08 Feb 2010 13:54:10 -0500, Prentice Bisbal wrote: > but I don't have that book handy The standard has lots of examples. http://www.mpi-forum.org/docs/docs.html You can do this, but for small structures, you're better off just packing buffers. For large structures containing variable-size fields, I think it is clearer to use MPI_BOTTOM instead of offsets from an arbitrary (instance-dependent) address. [...] > if (rank == 0) { > a_point.index = 1; > a_point.coords = malloc(3 * sizeof(int)); > a_point.coords[0] = 3; > a_point.coords[1] = 6; > a_point.coords[2] = 9; > } > > block_lengths[0] = 1; > block_lengths[1] = 3; > > type_list[0] = MPI_INT; > type_list[1] = MPI_INT; > > displacements[0] = 0; > MPI_Address(&a_point.index, &start_address); > MPI_Address(a_point.coords, &address); ^^ Rank 1 has not allocated this yet. Jed
Re: [OMPI users] Difficulty with MPI_Unpack
On Sun, 07 Feb 2010 22:40:55 -0500, Prentice Bisbal wrote: > Hello, everyone. I'm having trouble packing/unpacking this structure: > > typedef struct{ > int index; > int* coords; > }point; > > The size of the coords array is not known a priori, so it needs to be a > dynamic array. I'm trying to send it from one node to another using > MPI_Pack/MPI_Unpack as shown below. When I unpack it, I get this error > when unpacking the coords array: > > [fatboy:07360] *** Process received signal *** > [fatboy:07360] Signal: Segmentation fault (11) > [fatboy:07360] Signal code: Address not mapped (1) > [fatboy:07360] Failing at address: (nil) Looks like b_point.coords = NULL. Has this been allocated on rank=1? You might need to use MPI_Get_count to decide how much to allocate. Also, if you don't have a convenient upper bound on the size of the receive buffer, you can use MPI_Probe followed by MPI_Get_count to determine this before calling MPI_Recv. Jed
Re: [OMPI users] [mpich-discuss] problem with MPI_Get_count() for very long (but legal length) messages.
On Fri, 5 Feb 2010 14:28:40 -0600, Barry Smith wrote: > To cheer you up, when I run with openMPI it runs forever sucking down > 100% CPU trying to send the messages :-) On my test box (x86 with 8GB memory), Open MPI (1.4.1) does complete after several seconds, but still prints the wrong count. MPICH2 does not actually send the message, as you can see by running the attached code. # Open MPI 1.4.1, correct cols[0] [0] sending... [1] receiving... count -103432106, cols[0] 0 # MPICH2 1.2.1, incorrect cols[1] [1] receiving... [0] sending... [1] count -103432106, cols[0] 1 How much memory does crush have (you need about 7GB to do this without swapping)? In particular, most of the time it took Open MPI to send the message (with your source) was actually just spent faulting the send/recv buffers. The attached faults the buffers first, and the subsequent send/recv takes less than 2 seconds. Actually, it's clear that MPICH2 never touches either buffer because it returns immediately regardless of whether they have been faulted first. Jed #include #include #include int main(int argc,char **argv) { intierr,i,size,rank; intcnt = 433438806; MPI_Status status; long long *cols; MPI_Init(&argc,&argv); ierr = MPI_Comm_size(MPI_COMM_WORLD,&size); ierr = MPI_Comm_rank(MPI_COMM_WORLD,&rank); if (size != 2) { fprintf(stderr,"[%d] usage: mpiexec -n 2 %s\n",rank,argv[0]); MPI_Abort(MPI_COMM_WORLD,1); } cols = malloc(cnt*sizeof(long long)); for (i=0; i
Re: [OMPI users] speed up this problem by MPI
On Fri, 29 Jan 2010 11:25:09 -0500, Richard Treumann wrote: > Any support for automatic serialization of C++ objects would need to be in > some sophisticated utility that is not part of MPI. There may be such > utilities but I do not think anyone who has been involved in the discussion > knows of one you can use. I certainly do not. C++ really doesn't offer sufficient type introspection to implement something like this. Boost.MPI offers serialization for a few types (e.g. some STL containers), but the general solution that you would like just doesn't exist (you'd have to write special code for every type you want to be able to operate on). Python can do things like this, mpi4py can operate transparently on any (pickleable) object, and also offers complete bindings to the low-level MPI interface. CL-MPI (Common Lisp) can also do these things, but it's much less mature than mpi4py. Jed
Re: [OMPI users] ABI stabilization/versioning
On Tue, 26 Jan 2010 11:15:45 +, Dave Love wrote: > > Versions where bumped to 0.0.1 for libmpi which has no > > effect for dynamic linking. > > I've forgotten the rules on this, but the point is that it needs to > affect dynamic linking to avoid running with earlier libraries > (specifically picking up ones from 1.2, which is the most common > problem). Dave, I think you are correct that this has not actually been done. In particular, I have 1.4.1 installed, but the soname is still libmpi.so.0. It's irrelevant that the symbolic links are set up for libmpi.so.0.0.1, this minor versioning only affects which DSO gets used when the linker (not the loader) sees -lmpi. And inspecting a binary built in Sep 2008 (must have been 1.2.7), ldd resolves to my 1.4.1 copy without complaints. However, the loader is intelligent and at least offers a warning when I try to run this ancient binary ./a.out: Symbol `ompi_mpi_comm_null' has different size in shared object, consider re-linking Chapter 3 of this paper was useful to me when learning about ABI versioning. http://people.redhat.com/drepper/dsohowto.pdf Jed
Re: [OMPI users] ABI stabilization/versioning
On Mon, 25 Jan 2010 15:10:12 -0500, Jeff Squyres wrote: > Indeed. Our wrapper compilers currently explicitly list all 3 > libraries (-lmpi -lopen-rte -lopen-pal) because we don't know if those > libraries will be static or shared at link time. I am suggesting that it is unavoidable for the person doing the linking to be explicit about whether they want static or dynamic libs when they invoke mpicc. Consider the pkg-config model where you might write gcc -static -o my-app main.o `pkg-config --libs --static openmpi fftw3` gcc -o my-app main.o `pkg-config --libs openmpi fftw3` In MPI world, gcc -static -o my-app main.o `mpicc -showme:link-static` `pkg-config --libs --static fftw3` gcc -o my-app main.o `mpicc -showme:link` `pkg-config --libs fftw3` seems tolerable. The trick (as you point out) is to get the option processed when the wrapper is being invoked as the compiler instead of just for the -showme options. Possible options are defining an OMPI_STATIC environment variable or inspecting argv for --link:static (or some such). This is one many the reasons why wrappers are a horrible solution, especially when they are expected to be used in nontrivial cases. Ideally, the adopted plan could be done in some coordination with MPICH2 (which lacks a -showme:link analogue) so that it is not so hard to write portable build systems. > > On the cited bug report, I just wanted to note that collapsing > > libopen-rte and libopen-pal (even only in production builds) has the > > undesirable effect that their ABI cannot change without incrementing > > the soname of libmpi (i.e. user binaries are coupled just as tightly > > to these libraries as when they were separate but linked explicitly, > > so this offers no benefit at all). > > Indeed -- this is exactly the reason we ended up leaving libopen-* .so > versions at 0:0:0. But not versioning those libs isn't much of a solution either since it becomes possible to get an ABI mismatch at runtime (consider someone who uses them independently, or if they are packaged separately as in a distribution so that it becomes possible to update these out from underneath libmpi). > There's an additional variable -- we had considered collapsing all 3 > libraries into libmpi for production builds, My point was that this is no solution at all since you have to bump the soname any time you change libopen-*. So even users who NEVER call into libopen-* have to relink any time something happens there, despite their interface not changing. And that is exactly the situation if the wrappers continue to overlink AND libopen-* became versioned, so at least by keeping them separate, you give users the option of not overlinking (albeit manually) and the option of using libopen-* without libmpi. > Yuck. It's 2010 and we still don't have a standard way to represent link dependencies (pkg-config might be the closest thing, but it's bad if you have multiple versions of the same library, and the granularity is wrong, e.g. if you want to link some exotic lib statically and the common ones dynamically). Jed
Re: [OMPI users] ABI stabilization/versioning
On Mon, 25 Jan 2010 09:09:47 -0500, Jeff Squyres wrote: > The short version is that the possibility of static linking really > fouls up the scheme, and we haven't figured out a good way around this > yet. :-( So pkg-config addresses this with it's Libs.private field and an explicit command-line argument when you want static libs, e.g. $ pkg-config --libs libavcodec -lavcodec $ pkg-config --libs --static libavcodec -pthread -lavcodec -lz -lbz2 -lfaac -lfaad -lmp3lame -lopencore-amrnb -lopencore-amrwb -ltheoraenc -ltheoradec -lvorbisenc -lvorbis -logg -lx264 -lm -lxvidcore -ldl -lasound -lavutil There is no way to simultaneously (a) prevent overlinking shared libs and (b) correctly link static libs without an explicit statement from the user about whether to link *your library* statically or dynamically. Unfortunately, pkgconfig doesn't work well with multiple builds of a package, and doesn't know how to link some libs statically and some dynamically. On the cited bug report, I just wanted to note that collapsing libopen-rte and libopen-pal (even only in production builds) has the undesirable effect that their ABI cannot change without incrementing the soname of libmpi (i.e. user binaries are coupled just as tightly to these libraries as when they were separate but linked explicitly, so this offers no benefit at all). Jed
Re: [OMPI users] More NetBSD fixes
On Thu, 14 Jan 2010 21:55:06 -0500, Jeff Squyres wrote: > That being said, you could sign up on it and then set your membership to > receive no mail...? This is especially dangerous because the Open MPI lists munge the Reply-To header, which is a bad thing http://www.unicom.com/pw/reply-to-harmful.html But lots of mailers have poor default handling of mailing lists, so it's complicated. With munging, a mailer's "reply-to-sender" function will send mail *only* to the list and "reply-to-all" will send it to the list and any other recipients, but *not* the sender (unless the mailer does special detection of munged reply-to headers). This makes it rather difficult to participate in a discussion without receiving mail from the list, or even to reliably filter list traffic (you have to write filter rules that walk the References tree to find if it is something that would be interesting to you, and then you get false positives from people who reply to an existing thread when they wanted to make a new thread). Jed
Re: [OMPI users] MPI debugger
On Sun, 10 Jan 2010 19:29:18 +, Ashley Pittman wrote: > It'll show you parallel stack traces but won't let you single step for > example. Two lightweight options if you want stepping, breakpoints, watchpoints, etc. * Use serial debuggers on some interesting processes, for example with mpiexec -n 1 xterm -e gdb --args ./trouble args : -n 2 ./trouble args : -n 1 xterm -e gdb --args ./trouble args to put an xterm on rank 0 and 3 of a four process job (there are lots of other ways to get here). * MPICH2 has a poor-man's parallel debugger, mpiexec.mpd -gdb allows you to send the same gdb commands to each process and collate the output. Jed
Re: [OMPI users] Wrappers should put include path *after* user args
On Fri, 4 Dec 2009 16:20:23 -0500, Jeff Squyres wrote: > Oy -- more specifically, we should not be putting -I/usr/include on > the command line *at all* (because it's special and already included > by the compiler search paths; similar for /usr/lib and /usr/lib64). If I remember correctly, the issue was that some versions of gfortran were not searching /usr/include for mpif.h. > Can you send the contents of your > $prefix/share/openmpi/mpif90-wrapper-data.txt? Attached. Jed # There can be multiple blocks of configuration data, chosen by # compiler flags (using the compiler_args key to chose which block # should be activated. This can be useful for multilib builds. See the # multilib page at: #https://svn.open-mpi.org/trac/ompi/wiki/compilerwrapper3264 # for more information. project=Open MPI project_short=OMPI version=1.3.4 language=Fortran 90 compiler_env=FC compiler_flags_env=FCFLAGS compiler=/usr/bin/gfortran module_option=-I extra_includes= preprocessor_flags= compiler_flags=-pthread linker_flags= libs=-lmpi_f90 -lmpi_f77 -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl required_file= includedir=${includedir} libdir=${libdir}
[OMPI users] Wrappers should put include path *after* user args
Open MPI is installed by the distro with headers in /usr/include $ mpif90 -showme:compile -I/some/special/path -I/usr/include -pthread -I/usr/lib/openmpi -I/some/special/path Here's why it's a problem: HDF5 is also installed in /usr with modules at /usr/include/h5*.mod. A new HDF5 cannot be compiled using the wrappers because it will always resolve the USE statements to /usr/include which is binary-incompatible with the the new version (at a minimum, they "fixed" the size of an argument to H5Lget_info_f between 1.8.3 and 1.8.4). To build the library, the current choices are (a) get rid of the system copy before building (b) not use mpif90 wrapper I just checked that MPICH2 wrappers consistently put command-line args before the wrapper args. Jed
Re: [OMPI users] Program deadlocks, on simple send/recv loop
On Thu, 3 Dec 2009 12:21:50 -0500, Jeff Squyres wrote: > On Dec 3, 2009, at 10:56 AM, Brock Palen wrote: > > > The allocation statement is ok: > > allocate(vec(vec_size,vec_per_proc*(size-1))) > > > > This allocates memory vec(32768, 2350) It's easier to translate to C rather than trying to read Fortran directly. #define M 2350 #define N 32768 complex double vec[M*N]; > This means that in the first iteration, you're calling: > > irank = 1 > ivec = 1 > vec_ind = (47 - 1) * 50 + 1 = > call MPI_RECV(vec(1, 2301), 32768, ...) MPI_Recv(&vec[2300*N],N,...); > And in the last iteration, you're calling: > > irank = 47 > ivec = 50 > vec_ind = (47 - 1) * 50 + 50 = > call MPI_RECV(vec(1, 2350), 32768, ...) MPI_Recv(&vec[2349*N],N,...); > That doesn't seem right. Should be one non-overlapping column (C row) at a time. It will be contiguous in memory, but this isn't using that property. Jed
Re: [OMPI users] segmentation fault: Address not mapped
On Mon, 23 Nov 2009 10:39:28 -0800, George Bosilca wrote: > In the case of Open MPI we use pointers, which are different than int > on most cases I just want to comment that Open MPI's opaque (to the user) pointers are significantly better than int because it offers type safety. That is, the compiler can distinguish between MPI_Comm, MPI_Group, MPI_Status, MPI_Op, etc., and warn you if you mix them up. When they are all typedef'd to int, you get no such warnings, and instead just get runtime errors/crashes. > (btw int is what MPICH is using I think). It is. Jed
Re: [OMPI users] memchecker overhead?
Jeff Squyres wrote: > Verbs and Open MPI don't have these options on by default because a) > you need to compile against Valgrind's header files to get them to > work, and b) there's a tiny/small amount of overhead inserted by OMPI > telling Valgrind "this memory region is ok", but we live in an > intensely competitive HPC environment. It's certainly competitive, but we spend most of our implementation time getting things correct rather than tuning. The huge speed benefits come from algorithmic advances, and finding bugs quickly makes the implementation of new algorithms easier. I'm not arguing that it should be on by default, but it's helpful to have an environment where the lower-level libs are valgrind-clean. These days, I usually revert to MPICH when hunting something with valgrind, but use OMPI most other times. > The option to enable this Valgrind Goodness in OMPI is --with-valgrind. > I *think* the option may be the same for libibverbs, but I don't > remember offhand. I see plenty of warning over btl sm. Several variations, including the excessive --enable-debug --enable-mem-debug --enable-mem-profile \ --enable-memchecker --with-valgrind=/usr were not sufficient. (I think everything in this line except --with-valgrind increases the number of warnings, but it's nontrivial with plain --with-valgrind.) Thanks, Jed signature.asc Description: OpenPGP digital signature
Re: [OMPI users] memchecker overhead?
Samuel K. Gutierrez wrote: > Hi Jed, > > I'm not sure if this will help, but it's worth a try. Turn off OMPI's > memory wrapper and see what happens. > > c-like shell > setenv OMPI_MCA_memory_ptmalloc2_disable 1 > > bash-like shell > export OMPI_MCA_memory_ptmalloc2_disable=1 > > Also add the following MCA parameter to you run command. > > --mca mpi_leave_pinned 0 Thanks for the tip, but these make very little difference. Jed signature.asc Description: OpenPGP digital signature
Re: [OMPI users] memchecker overhead?
Jeff Squyres wrote: > Using --enable-debug adds in a whole pile of developer-level run-time > checking and whatnot. You probably don't want that on production runs. I have found that --enable-debug --enable-memchecker actually produces more valgrind noise than leaving them off. Are there options to make Open MPI strict about initializing and freeing memory? At one point I tried to write policy files, but even with judicious globbing, I kept getting different warnings when run on a different program. (All these codes were squeaky-clean under MPICH2.) Jed signature.asc Description: OpenPGP digital signature
Re: [OMPI users] Question about OpenMPI performance vs. MVAPICH2
Brian Powell wrote: > I ran a final test which I find very strange: I ran the same test case > on 1 cpu. The MVAPICH2 case was 23% faster!?!? This makes little sense > to me. Both are using ifort as the mpif90 compiler using *identical* > optimization flags, etc. I don't understand how the results could be > different. Are you saying the output of mpicc/mpif90 -show has the same optimization flags? MPICH2 usually puts it's own optimization flags into the wrappers. Jed signature.asc Description: OpenPGP digital signature
[OMPI users] MPI_Barrier called late within ompi_mpi_finalize when MPIIO fd not closed
This helped me track down a leaked file descriptor, but I think the order of events is not desirable. If an MPIIO file descriptor is not closed before MPI_Finalize, I get the following. *** An error occurred in MPI_Barrier *** after MPI was finalized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [brakk:1193] Abort after MPI_FINALIZE completed successfully; not able to guarantee that all other processes were killed! [Switching to Thread 0x7fa523b78710 (LWP 1193)] Breakpoint 2, 0x7fa51ed39a20 in exit () from /lib/libc.so.6 (gdb) bt #0 0x7fa51ed39a20 in exit () from /lib/libc.so.6 #1 0x7fa520ff6613 in ompi_mpi_abort () from /usr/lib/libmpi.so.0 #2 0x7fa520fe59b7 in ompi_mpi_errors_are_fatal_comm_handler () from /usr/lib/libmpi.so.0 #3 0x7fa52100acb2 in PMPI_Barrier () from /usr/lib/libmpi.so.0 #4 0x7fa52106638a in mca_io_romio_dist_MPI_File_close () from /usr/lib/libmpi.so.0 #5 0x7fa520feaa2e in file_destructor () from /usr/lib/libmpi.so.0 #6 0x7fa520fea7c1 in ompi_file_finalize () from /usr/lib/libmpi.so.0 #7 0x7fa520ff7496 in ompi_mpi_finalize () from /usr/lib/libmpi.so.0 #8 0x7fa5233bc2d1 in PetscFinalize () at pinit.c:897 #9 0x00402091 in main (argc=1, args=0x7fff70f1f498) at ex5.c:72 Open MPI 1.3.3, GCC-4.4.0 Linux brakk 2.6.30-ARCH #1 SMP PREEMPT Fri Jun 19 20:44:03 UTC 2009 x86_64 Intel(R) Core(TM)2 Duo CPU T9300 @ 2.50GHz GenuineIntel GNU/Linux Jed signature.asc Description: OpenPGP digital signature
Re: [OMPI users] Bogus memcpy or bogus valgrind record
Jeff Squyres wrote: > But I'm able to replicate your error (but shouldn't the 2nd buffer be > the 1st + size (not 2)?) -- let me dig into it a bit... we definitely > shouldn't be getting invalid writes in the convertor, etc. As Eugene pointed out earlier, it is fine. dataloctab = malloc (2 * (procglbnbr + 1) * sizeof (int)); dataglbtab = dataloctab + 2; dataloctab is the 2-element send buffer, dataglbtab is the receive buffer of length 2*procglbnbr. Jed signature.asc Description: OpenPGP digital signature
Re: [OMPI users] Open MPI programs with autoconf/automake?
On Mon 2008-11-10 12:35, Raymond Wan wrote: > One thing I was wondering about was whether it is possible, though the > use of #define's, to create code that is both multi-processor > (MPI/mpic++) and single-processor (normal g++). That is, if users do > not have any MPI installed, it compiles it with g++. > > With #define's and compiler flags, I think that can be easily done -- > was wondering if this is something that developers using MPI do and > whether AC/AM supports it. The normal way to do this is by building against a serial implementation of MPI. Lots of parallel numerical libraries bundle such an implementation so you could just grab one of those. For example, see PETSc's mpiuni ($PETSC_DIR/include/mpiuni/mpi.h and $PETSC_DIR/src/sys/mpiuni/mpi.c) which implements many MPI calls as macros. Note that your serial implementation only needs to provide the subset of MPI that your program actually uses. For instance, if you never send messages to yourself, you can implement MPI_Send as MPI_Abort since it should never be called in serial. Jed pgprlwscpafzZ.pgp Description: PGP signature
Re: [OMPI users] OpenMPI runtime-specific environment variable?
On Wed 2008-10-22 00:40, Reuti wrote: > > Okay, now I see. Why not just call MPI_Comm_size(MPI_COMM_WORLD, > &nprocs) When nprocs is 1, it's a serial run. It can also be executed > when not running within mpirun AFAICS. This is absolutely NOT okay. You cannot call any MPI functions before MPI_Init (and at least OMPI 1.2+ and MPICH2 1.1a will throw an error if you try). I'm slightly confused about the original problem. Is the program linked against an MPI when running in serial? You have to recompile anyway if you change MPI implementation, so if it's not linked against a real MPI then you know at compile time. But what is the problem with calling MPI_Init for a serial job? All implementations I've used allow you to call MPI_Init when the program is run as ./foo (no mpirun). Jed pgpQQUWd4bLAL.pgp Description: PGP signature
Re: [OMPI users] on SEEK_*
On Thu 2008-10-16 08:21, Jeff Squyres wrote: > FWIW: https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/20 is a > placemarker for discussion for the upcoming MPI Forum meeting (next > week). > > Also, be aware that OMPI's 1.2.7 solution isn't perfect, either. You > can see from ticket 20 that it actually causes a problem if you try to > use SEEK_SET in a switch/case statement. But we did this a little > better in the trunk/v1.3 (see > https://svn.open-mpi.org/trac/ompi/changeset/19494); this solution *does* > allow for SEEK_SET to be used in a case statement, but it does always > bring in (probably not a huge deal). I see. > The real solution is that we're likely going to change these names to > something else in the MPI spec itself. And/or drop the C++ bindings > altogether (see http://lists.mpi-forum.org/mpi-22/2008/10/0177.php). Radical. I don't use the C++ bindings anyway. I especially like proposal (4) Data in User-Defined Callbacks. On a related note, it would be nice to be able to call an MPI_Op from user code. For instance, I have an irregular Reduce-like operation where each proc needs to reduce data from a few other procs (much fewer than the entire communicator). I implement this using a few nonblocking point-to-point calls followed by a local reduction. I would like my special reduction to accept an arbitrary MPI_Op, but I currently use a function pointer. Having a public version of ompi_op_reduce would make this much cleaner. > Additionally -- I should have pointed this out in my first mail -- you > can also just use MPI_SEEK_SET (and friends). The spec defines that > these constants must have the same values as their MPI::SEEK_* > counterparts. Right, MPI::SEEK_* is never used. Thanks Jeff. Jed pgpH59WXzCO57.pgp Description: PGP signature
Re: [OMPI users] on SEEK_*
On Thu 2008-10-16 07:43, Jeff Squyres wrote: > On Oct 16, 2008, at 6:29 AM, Jed Brown wrote: > > Open MPI doesn't require undef'ing of anything. It should also not > require any special ordering of include files. Specifically, the > following codes both compile fine for me with 1.2.8 and the OMPI SVN > trunk (which is what I assume you mean by "-dev"?): That's what I meant. This, works with 1.2.7 but not with -dev: #include #undef SEEK_SET #undef SEEK_CUR #undef SEEK_END #include If iostream is replaced by stdio, then both fail. > This is actually a problem in the MPI-2 spec; the names "MPI::SEEK_SET" > (and friends) were unfortunately chosen poorly. Hopefully that'll be > fixed relatively soon, in MPI-2.2. It wasn't addressed in the MPI-2.1 spec I was reading, hence my confusion. When namespaces and macros don't play well. > MPICH chose to handle this situation a different way than we did, and > apparently requires that you either #undef something or you #define an > MPICH-specific macro. I guess the portable way might be to just always > define that MPICH-specific macro. It should be harmless for OMPI. I'll go with this, thanks. > FWIW, I was chatting with the MPICH developers at the recent MPI Forum > meeting and showed them how we did our SEEK_* solution in Open MPI. Certainly the OMPI solution is better for users. Jed pgpnUCoTagZ3S.pgp Description: PGP signature
[OMPI users] on SEEK_*
I've just run into this chunk of code. /* MPICH2 will fail if SEEK_* macros are defined * because they are also C++ enums. Undefine them * when including mpi.h and then redefine them * for sanity. */ # ifdef SEEK_SET #define MB_SEEK_SET SEEK_SET #define MB_SEEK_CUR SEEK_CUR #define MB_SEEK_END SEEK_END #undef SEEK_SET #undef SEEK_CUR #undef SEEK_END # endif #include "mpi.h" # ifdef MB_SEEK_SET #define SEEK_SET MB_SEEK_SET #define SEEK_CUR MB_SEEK_CUR #define SEEK_END MB_SEEK_END #undef MB_SEEK_SET #undef MB_SEEK_CUR #undef MB_SEEK_END # endif MPICH2 (1.1.0a1) gives these errors if SEEK_* are present: /opt/mpich2/include/mpicxx.h:26:2: error: #error "SEEK_SET is #defined but must not be for the C++ binding of MPI" /opt/mpich2/include/mpicxx.h:30:2: error: #error "SEEK_CUR is #defined but must not be for the C++ binding of MPI" /opt/mpich2/include/mpicxx.h:35:2: error: #error "SEEK_END is #defined but must not be for the C++ binding of MPI" but when SEEK_* is not present and iostream has been included, OMPI-dev gives these errors. /home/ompi/include/openmpi/ompi/mpi/cxx/mpicxx.h:53: error: ‘SEEK_SET’ was not declared in this scope /home/ompi/include/openmpi/ompi/mpi/cxx/mpicxx.h:54: error: ‘SEEK_CUR’ was not declared in this scope /home/ompi/include/openmpi/ompi/mpi/cxx/mpicxx.h:55: error: ‘SEEK_END’ was not declared in this scope There is a subtle difference between OMPI 1.2.7 and -dev at least with GCC 4.3.2. If iostream was included before mpi.h and then SEEK_* are #undef'd then 1.2.7 succeeds while -dev fails with the message above. If stdio.h is included and SEEK_* are #undef'd then both OMPI versions fail. MPICH2 requires in both cases that SEEK_* be #undef'd. What do you recommend to remain portable? Is this really an MPICH2 issue? The standard doesn't seem to address this issue. The MPICH2 FAQ has this http://www.mcs.anl.gov/research/projects/mpich2/support/index.php?s=faqs#cxxseek Jed pgpDbo1XASXHc.pgp Description: PGP signature
Re: [OMPI users] compilation error about Open Macro when building the code with OpenMPI on Mac OS 10.5.5
On Wed, Oct 8, 2008 at 21:19, Sudhakar Mahalingam wrote: > I am having a problem about "Open" Macro's number of arguments, when I try > to build a C++ code with the openmpi-1.2.7 on my Mac OS 10.5.5 machine. The > error message is given below. When I look at the file.h and file_inln.h > header files in the cxx folder, I am seeing that the "Open" function indeed > takes four arguments but I don't know why there is this error about the > number of arguments of 4. Does anyone else seen this type of error before ?. MPI::File::Open is an inline function, not a macro. You must have an unqualified Open macro defined in this compilation unit. Maybe in one of the headers that were included in your code before hdf5.h. Does it work if you include hdf5.h first? Jed
Re: [OMPI users] Execution in multicore machines
On Mon 2008-09-29 20:30, Leonardo Fialho wrote: > 1) If I use one node (8 cores) the "user" % is around 100% per core. The > execution time is around 430 seconds. > > 2) If I use 2 nodes (4 cores in each node) the "user" % is around 95% > per core and the "sys" % is 5%. The execution time is around 220 seconds. > > 3) If I use 4 nodes (1 cores in each node) the "user" % is around %85 > per core and the "sys" % is 15%. The execution time is around 200 > seconds. Do you mean 2 cores per node (1 core per socket). > Well... the questions are: > > A) The execution time in case "1" should be smaller (only sm > communication, no?) than case "2" and "3", no? Cache problems? Is this benchmark memory bandwidth limited? Your results are fairly typical for sparse matrix kernels. One core can more or less saturate the bus on its own, two cores can overlap memory access so it doesn't hurt too much, more than two and they are all waiting on memory. The extra cores are cheaper than more sockets but they don't do much/any good for many workloads. > B) Why the "sys" time while using communication inter nodes? NIC driver? > Why this time increase when I balance the load across the nodes? Messages over Ethernet cost more than messages in shared memory. When you only use 1 core per socket, the application is faster because the single thread has the full memory bandwidth to itself, however MPI needs to move more data over the wire so that phase costs more. If your network was faster (e.g. InfiniBand) you could expect the communication to stay quite cheap even with only one process per node. Jed pgp5Y_f0ZRGxY.pgp Description: PGP signature
Re: [OMPI users] where is mpif.h ?
On Tue 2008-09-23 08:50, Simon Hammond wrote: > Yes, it should be there. Shouldn't the path be automatically included by the mpif77 wrapper? I ran into this problem when building BLACS (my default OpenMPI 1.2.7 lives in /usr, MPICH2 is at /opt/mpich2). The build tries $ /usr/bin/mpif90 -c -I. -fPIC -Wno-unused-variable -g bi_f77_mpi_attr_get.f Error: Can't open included file 'mpif.h' but this succeeds $ /usr/bin/mpif90 -c -I. -I/usr/include -fPIC -Wno-unused-variable -g bi_f77_mpi_attr_get.f and this works fine as well $ /opt/mpich2/mpif90 -c -I. -fPIC -Wno-unused-variable -g bi_f77_mpi_attr_get.f Is this the expected behavior? Jed pgp66MlGd2epW.pgp Description: PGP signature