Re: [OMPI devel] trunk borked -- my fault
On Aug 4, 2009, at 5:50 PM, Jeff Squyres (jsquyres) wrote: Ah -- I see an AC 2.63b release note: ** AC_REQUIRE now detects the case of an outer macro which first expands then later indirectly requires the same inner macro. Previously, Yes, this is exactly what was happening. The AC_REQUIRE's that I added force the tests to be above the respective stdout section headers, which is a little bit of a bummer. I'll fix that. -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] trunk borked -- my fault
Ah -- I see an AC 2.63b release note: ** AC_REQUIRE now detects the case of an outer macro which first expands then later indirectly requires the same inner macro. Previously, this case led to silent out-of-order expansion (bug present since 2.50); it now issues a syntax warning, and duplicates the expansion of the inner macro to guarantee dependencies have been met. See the manual for advice on how to refactor macros in order to avoid the bug in earlier autoconf versions and avoid increased script size in the current version. This looks related to what I am seeing. /me goes to investigate... On Aug 4, 2009, at 5:47 PM, George Bosilca wrote: Indeed, r21759 solves the problem. ompi compile successfully on Mac OS X with autoconf 2.64. Thanks, george. On Aug 4, 2009, at 17:41 , Jeff Squyres wrote: > On Aug 4, 2009, at 5:37 PM, George Bosilca wrote: > >> I used 2.64 for about a week on a bunch of machines. I never had >> problems with it before... >> >> After checking it turned out that autoconf 2.64 was freshly installed >> on my Mac, so this might be a problem with autoconf 2.64 and MAC OS >> X ... I'll go back to 2.63 until we figure out a way to solve these >> problems. >> > > FWIW, I saw the warnings on Linux as well, and then configure failed > later in spectacular and interesting ways (I didn't let it get to > the build because configure was so borked up -- all the individual > POSIX .h file tests said that the file was present but could not be > compiled because somehow it was stuck trying to compile them with > gfortran (!) instead of gcc). Something changed in AC2.64 with > regards to how they do language REQUIRE'ing, etc. that I don't fully > understand. > > Let me know if the workaround in r21759 works for you. > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] trunk borked -- my fault
Indeed, r21759 solves the problem. ompi compile successfully on Mac OS X with autoconf 2.64. Thanks, george. On Aug 4, 2009, at 17:41 , Jeff Squyres wrote: On Aug 4, 2009, at 5:37 PM, George Bosilca wrote: I used 2.64 for about a week on a bunch of machines. I never had problems with it before... After checking it turned out that autoconf 2.64 was freshly installed on my Mac, so this might be a problem with autoconf 2.64 and MAC OS X ... I'll go back to 2.63 until we figure out a way to solve these problems. FWIW, I saw the warnings on Linux as well, and then configure failed later in spectacular and interesting ways (I didn't let it get to the build because configure was so borked up -- all the individual POSIX .h file tests said that the file was present but could not be compiled because somehow it was stuck trying to compile them with gfortran (!) instead of gcc). Something changed in AC2.64 with regards to how they do language REQUIRE'ing, etc. that I don't fully understand. Let me know if the workaround in r21759 works for you. -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trunk borked -- my fault
On Aug 4, 2009, at 5:37 PM, George Bosilca wrote: I used 2.64 for about a week on a bunch of machines. I never had problems with it before... After checking it turned out that autoconf 2.64 was freshly installed on my Mac, so this might be a problem with autoconf 2.64 and MAC OS X ... I'll go back to 2.63 until we figure out a way to solve these problems. FWIW, I saw the warnings on Linux as well, and then configure failed later in spectacular and interesting ways (I didn't let it get to the build because configure was so borked up -- all the individual POSIX .h file tests said that the file was present but could not be compiled because somehow it was stuck trying to compile them with gfortran (!) instead of gcc). Something changed in AC2.64 with regards to how they do language REQUIRE'ing, etc. that I don't fully understand. Let me know if the workaround in r21759 works for you. -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] trunk borked -- my fault
https://svn.open-mpi.org/trac/ompi/changeset/21759 seems to make us play well with AC 2.64. To be honest, I'm not sure why this change works, but it does. I'm going to ping Ralf W. and see if he's got any insight here... On Aug 4, 2009, at 5:17 PM, Jeff Squyres (jsquyres) wrote: Checking this further, my C++ changes were r21755. Updating my SVN tree to the commit before that (r21754), I see that AC 2.64 on this tree issues these same warnings, but then configure works and the build seems to proceed as normal. Did you try AC 2.64 before today? If not, I'd advise backing off AC 2.64 and moving back down to AC 2.63 until we can figure those warnings out. They *seem* to be harmless, but I'm not entirely sure. It looks like some things changed 2.63->2.64 with regards to how languages are selected / used within AC 2.64 that break some of the things I did today (perhaps AC 2.64 just got more strict...?). On Aug 4, 2009, at 4:43 PM, Jeff Squyres (jsquyres) wrote: > Doh. I tested with 2.63. I'll check out 2.64 right now... > > > On Aug 4, 2009, at 4:37 PM, George Bosilca wrote: > > > Not completely fixed. With the latest version of autoconf (2.64) I > get > > a bunch of warnings. > > > > configure.ac:449: warning: AC_REQUIRE: `AC_PROG_CXX' was expanded > > before it was required > > ../../lib/autoconf/c.m4:671: AC_LANG_COMPILER(C++) is expanded > from... > > ../../lib/autoconf/lang.m4:315: AC_LANG_COMPILER_REQUIRE is expanded > > from... > > ../../lib/autoconf/general.m4:2735: AC_RUN_IFELSE is expanded > from... > > ../../lib/m4sugar/m4sh.m4:620: AS_IF is expanded from... > > ../../lib/autoconf/general.m4:2018: AC_CACHE_VAL is expanded from... > > ../../lib/autoconf/general.m4:2039: AC_CACHE_CHECK is expanded > from... > > config/ompi_check_compiler_works.m4:28: OMPI_CHECK_COMPILER_WORKS is > > expanded from... > > config/ompi_setup_cxx.m4:48: _OMPI_SETUP_CXX_COMPILER is expanded > > from... > > config/ompi_setup_cxx.m4:28: OMPI_SETUP_CXX is expanded from... > > configure.ac:449: the top level > > configure.ac:488: warning: AC_REQUIRE: `AC_PROG_F77' was expanded > > before it was required > > ../../lib/autoconf/fortran.m4:272: AC_LANG_COMPILER(Fortran 77) is > > expanded from... > > config/ompi_setup_f77.m4:35: OMPI_SETUP_F77 is expanded from... > > configure.ac:488: the top level > > configure.ac:603: warning: AC_REQUIRE: `AC_PROG_FC' was expanded > > before it was required > > ../../lib/autoconf/fortran.m4:279: AC_LANG_COMPILER(Fortran) is > > expanded from... > > config/ompi_setup_f90.m4:37: OMPI_SETUP_F90 is expanded from... > > configure.ac:603: the top level > > > >george. > > > > > > On Aug 4, 2009, at 14:49 , Jeff Squyres wrote: > > > > > Should be fixed in https://svn.open-mpi.org/trac/ompi/changeset/ > > > 21758. Sorry for the interruption... > > > > > > > > > On Aug 4, 2009, at 10:24 AM, Jeff Squyres wrote: > > > > > >> Doh! > > >> > > >> I committed the "we don't need no stinkin' C++ compiler" changes > > >> this morning after a bunch of testing, but I totally neglected to > > >> test the case *with* a C++ compiler. :-( > > >> > > >> So the trunk is borked at the moment; I'm working on a fix... > > >> > > >> -- > > >> Jeff Squyres > > >> jsquy...@cisco.com > > >> > > >> ___ > > >> devel mailing list > > >> de...@open-mpi.org > > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > > > > -- > > > Jeff Squyres > > > jsquy...@cisco.com > > > > > > ___ > > > devel mailing list > > > de...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] trunk borked -- my fault
I used 2.64 for about a week on a bunch of machines. I never had problems with it before... After checking it turned out that autoconf 2.64 was freshly installed on my Mac, so this might be a problem with autoconf 2.64 and MAC OS X ... I'll go back to 2.63 until we figure out a way to solve these problems. george. On Aug 4, 2009, at 17:17 , Jeff Squyres wrote: Checking this further, my C++ changes were r21755. Updating my SVN tree to the commit before that (r21754), I see that AC 2.64 on this tree issues these same warnings, but then configure works and the build seems to proceed as normal. Did you try AC 2.64 before today? If not, I'd advise backing off AC 2.64 and moving back down to AC 2.63 until we can figure those warnings out. They *seem* to be harmless, but I'm not entirely sure. It looks like some things changed 2.63->2.64 with regards to how languages are selected / used within AC 2.64 that break some of the things I did today (perhaps AC 2.64 just got more strict...?). On Aug 4, 2009, at 4:43 PM, Jeff Squyres (jsquyres) wrote: Doh. I tested with 2.63. I'll check out 2.64 right now... On Aug 4, 2009, at 4:37 PM, George Bosilca wrote: > Not completely fixed. With the latest version of autoconf (2.64) I get > a bunch of warnings. > > configure.ac:449: warning: AC_REQUIRE: `AC_PROG_CXX' was expanded > before it was required > ../../lib/autoconf/c.m4:671: AC_LANG_COMPILER(C++) is expanded from... > ../../lib/autoconf/lang.m4:315: AC_LANG_COMPILER_REQUIRE is expanded > from... > ../../lib/autoconf/general.m4:2735: AC_RUN_IFELSE is expanded from... > ../../lib/m4sugar/m4sh.m4:620: AS_IF is expanded from... > ../../lib/autoconf/general.m4:2018: AC_CACHE_VAL is expanded from... > ../../lib/autoconf/general.m4:2039: AC_CACHE_CHECK is expanded from... > config/ompi_check_compiler_works.m4:28: OMPI_CHECK_COMPILER_WORKS is > expanded from... > config/ompi_setup_cxx.m4:48: _OMPI_SETUP_CXX_COMPILER is expanded > from... > config/ompi_setup_cxx.m4:28: OMPI_SETUP_CXX is expanded from... > configure.ac:449: the top level > configure.ac:488: warning: AC_REQUIRE: `AC_PROG_F77' was expanded > before it was required > ../../lib/autoconf/fortran.m4:272: AC_LANG_COMPILER(Fortran 77) is > expanded from... > config/ompi_setup_f77.m4:35: OMPI_SETUP_F77 is expanded from... > configure.ac:488: the top level > configure.ac:603: warning: AC_REQUIRE: `AC_PROG_FC' was expanded > before it was required > ../../lib/autoconf/fortran.m4:279: AC_LANG_COMPILER(Fortran) is > expanded from... > config/ompi_setup_f90.m4:37: OMPI_SETUP_F90 is expanded from... > configure.ac:603: the top level > >george. > > > On Aug 4, 2009, at 14:49 , Jeff Squyres wrote: > > > Should be fixed in https://svn.open-mpi.org/trac/ompi/changeset/ > > 21758. Sorry for the interruption... > > > > > > On Aug 4, 2009, at 10:24 AM, Jeff Squyres wrote: > > > >> Doh! > >> > >> I committed the "we don't need no stinkin' C++ compiler" changes > >> this morning after a bunch of testing, but I totally neglected to > >> test the case *with* a C++ compiler. :-( > >> > >> So the trunk is borked at the moment; I'm working on a fix... > >> > >> -- > >> Jeff Squyres > >> jsquy...@cisco.com > >> > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trunk borked -- my fault
Checking this further, my C++ changes were r21755. Updating my SVN tree to the commit before that (r21754), I see that AC 2.64 on this tree issues these same warnings, but then configure works and the build seems to proceed as normal. Did you try AC 2.64 before today? If not, I'd advise backing off AC 2.64 and moving back down to AC 2.63 until we can figure those warnings out. They *seem* to be harmless, but I'm not entirely sure. It looks like some things changed 2.63->2.64 with regards to how languages are selected / used within AC 2.64 that break some of the things I did today (perhaps AC 2.64 just got more strict...?). On Aug 4, 2009, at 4:43 PM, Jeff Squyres (jsquyres) wrote: Doh. I tested with 2.63. I'll check out 2.64 right now... On Aug 4, 2009, at 4:37 PM, George Bosilca wrote: > Not completely fixed. With the latest version of autoconf (2.64) I get > a bunch of warnings. > > configure.ac:449: warning: AC_REQUIRE: `AC_PROG_CXX' was expanded > before it was required > ../../lib/autoconf/c.m4:671: AC_LANG_COMPILER(C++) is expanded from... > ../../lib/autoconf/lang.m4:315: AC_LANG_COMPILER_REQUIRE is expanded > from... > ../../lib/autoconf/general.m4:2735: AC_RUN_IFELSE is expanded from... > ../../lib/m4sugar/m4sh.m4:620: AS_IF is expanded from... > ../../lib/autoconf/general.m4:2018: AC_CACHE_VAL is expanded from... > ../../lib/autoconf/general.m4:2039: AC_CACHE_CHECK is expanded from... > config/ompi_check_compiler_works.m4:28: OMPI_CHECK_COMPILER_WORKS is > expanded from... > config/ompi_setup_cxx.m4:48: _OMPI_SETUP_CXX_COMPILER is expanded > from... > config/ompi_setup_cxx.m4:28: OMPI_SETUP_CXX is expanded from... > configure.ac:449: the top level > configure.ac:488: warning: AC_REQUIRE: `AC_PROG_F77' was expanded > before it was required > ../../lib/autoconf/fortran.m4:272: AC_LANG_COMPILER(Fortran 77) is > expanded from... > config/ompi_setup_f77.m4:35: OMPI_SETUP_F77 is expanded from... > configure.ac:488: the top level > configure.ac:603: warning: AC_REQUIRE: `AC_PROG_FC' was expanded > before it was required > ../../lib/autoconf/fortran.m4:279: AC_LANG_COMPILER(Fortran) is > expanded from... > config/ompi_setup_f90.m4:37: OMPI_SETUP_F90 is expanded from... > configure.ac:603: the top level > >george. > > > On Aug 4, 2009, at 14:49 , Jeff Squyres wrote: > > > Should be fixed in https://svn.open-mpi.org/trac/ompi/changeset/ > > 21758. Sorry for the interruption... > > > > > > On Aug 4, 2009, at 10:24 AM, Jeff Squyres wrote: > > > >> Doh! > >> > >> I committed the "we don't need no stinkin' C++ compiler" changes > >> this morning after a bunch of testing, but I totally neglected to > >> test the case *with* a C++ compiler. :-( > >> > >> So the trunk is borked at the moment; I'm working on a fix... > >> > >> -- > >> Jeff Squyres > >> jsquy...@cisco.com > >> > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] trunk borked -- my fault
Doh. I tested with 2.63. I'll check out 2.64 right now... On Aug 4, 2009, at 4:37 PM, George Bosilca wrote: Not completely fixed. With the latest version of autoconf (2.64) I get a bunch of warnings. configure.ac:449: warning: AC_REQUIRE: `AC_PROG_CXX' was expanded before it was required ../../lib/autoconf/c.m4:671: AC_LANG_COMPILER(C++) is expanded from... ../../lib/autoconf/lang.m4:315: AC_LANG_COMPILER_REQUIRE is expanded from... ../../lib/autoconf/general.m4:2735: AC_RUN_IFELSE is expanded from... ../../lib/m4sugar/m4sh.m4:620: AS_IF is expanded from... ../../lib/autoconf/general.m4:2018: AC_CACHE_VAL is expanded from... ../../lib/autoconf/general.m4:2039: AC_CACHE_CHECK is expanded from... config/ompi_check_compiler_works.m4:28: OMPI_CHECK_COMPILER_WORKS is expanded from... config/ompi_setup_cxx.m4:48: _OMPI_SETUP_CXX_COMPILER is expanded from... config/ompi_setup_cxx.m4:28: OMPI_SETUP_CXX is expanded from... configure.ac:449: the top level configure.ac:488: warning: AC_REQUIRE: `AC_PROG_F77' was expanded before it was required ../../lib/autoconf/fortran.m4:272: AC_LANG_COMPILER(Fortran 77) is expanded from... config/ompi_setup_f77.m4:35: OMPI_SETUP_F77 is expanded from... configure.ac:488: the top level configure.ac:603: warning: AC_REQUIRE: `AC_PROG_FC' was expanded before it was required ../../lib/autoconf/fortran.m4:279: AC_LANG_COMPILER(Fortran) is expanded from... config/ompi_setup_f90.m4:37: OMPI_SETUP_F90 is expanded from... configure.ac:603: the top level george. On Aug 4, 2009, at 14:49 , Jeff Squyres wrote: > Should be fixed in https://svn.open-mpi.org/trac/ompi/changeset/ > 21758. Sorry for the interruption... > > > On Aug 4, 2009, at 10:24 AM, Jeff Squyres wrote: > >> Doh! >> >> I committed the "we don't need no stinkin' C++ compiler" changes >> this morning after a bunch of testing, but I totally neglected to >> test the case *with* a C++ compiler. :-( >> >> So the trunk is borked at the moment; I'm working on a fix... >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] trunk borked -- my fault
Not completely fixed. With the latest version of autoconf (2.64) I get a bunch of warnings. configure.ac:449: warning: AC_REQUIRE: `AC_PROG_CXX' was expanded before it was required ../../lib/autoconf/c.m4:671: AC_LANG_COMPILER(C++) is expanded from... ../../lib/autoconf/lang.m4:315: AC_LANG_COMPILER_REQUIRE is expanded from... ../../lib/autoconf/general.m4:2735: AC_RUN_IFELSE is expanded from... ../../lib/m4sugar/m4sh.m4:620: AS_IF is expanded from... ../../lib/autoconf/general.m4:2018: AC_CACHE_VAL is expanded from... ../../lib/autoconf/general.m4:2039: AC_CACHE_CHECK is expanded from... config/ompi_check_compiler_works.m4:28: OMPI_CHECK_COMPILER_WORKS is expanded from... config/ompi_setup_cxx.m4:48: _OMPI_SETUP_CXX_COMPILER is expanded from... config/ompi_setup_cxx.m4:28: OMPI_SETUP_CXX is expanded from... configure.ac:449: the top level configure.ac:488: warning: AC_REQUIRE: `AC_PROG_F77' was expanded before it was required ../../lib/autoconf/fortran.m4:272: AC_LANG_COMPILER(Fortran 77) is expanded from... config/ompi_setup_f77.m4:35: OMPI_SETUP_F77 is expanded from... configure.ac:488: the top level configure.ac:603: warning: AC_REQUIRE: `AC_PROG_FC' was expanded before it was required ../../lib/autoconf/fortran.m4:279: AC_LANG_COMPILER(Fortran) is expanded from... config/ompi_setup_f90.m4:37: OMPI_SETUP_F90 is expanded from... configure.ac:603: the top level george. On Aug 4, 2009, at 14:49 , Jeff Squyres wrote: Should be fixed in https://svn.open-mpi.org/trac/ompi/changeset/ 21758. Sorry for the interruption... On Aug 4, 2009, at 10:24 AM, Jeff Squyres wrote: Doh! I committed the "we don't need no stinkin' C++ compiler" changes this morning after a bunch of testing, but I totally neglected to test the case *with* a C++ compiler. :-( So the trunk is borked at the moment; I'm working on a fix... -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trunk borked -- my fault
Should be fixed in https://svn.open-mpi.org/trac/ompi/changeset/ 21758. Sorry for the interruption... On Aug 4, 2009, at 10:24 AM, Jeff Squyres wrote: Doh! I committed the "we don't need no stinkin' C++ compiler" changes this morning after a bunch of testing, but I totally neglected to test the case *with* a C++ compiler. :-( So the trunk is borked at the moment; I'm working on a fix... -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com
[OMPI devel] trunk borked -- my fault
Doh! I committed the "we don't need no stinkin' C++ compiler" changes this morning after a bunch of testing, but I totally neglected to test the case *with* a C++ compiler. :-( So the trunk is borked at the moment; I'm working on a fix... -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] Device failover on ob1
>From my perspective, the assumption that the low-level is reliable is >completely consistent with the assumptions that went into the ob1 design, so I don't see changes you may propose as a problem in principal. Thanks a lot for the clarification, Rich On 8/3/09 9:39 AM, "Mouhamed Gueye" wrote: Hi list, I'll try to answer to the main concerns so far. We chose to work on ob1 for mainly 2 reasons: - we focused first on fixing dr but were quite disappointed by its performance in comparison with ob1. Then, we oriented our work on ob1 to provide failover while keeping good performance. - Secondly, we wanted to avoid as much as possible to fork ob1 to stay up-to-date with the code base. Plus, the failover layer is so thin (in comparison with the code base) that it would not make sense to fork the base into a new pml. But we were aware that ob1 won't allow any non-zero impact change and that is why the added code is configured out by default. Actually, we wanted to address long jobs that can afford a very little performance loss but won't allow aborting after several hours or days of computation because of one port failure. The goal of this prototype is to provide a proof of concept for discussion, as we know there are other people working on this subject. As stated in the previous mail, the idea is to store any sent btl descriptor until it is marked as delivered. For that, we rely on completion callbacks and the assumption, clearly, is that a completion function called means message delivery to the remote card. The underlying btl is the one that ensures message delivery. This is currently the case of the openib btl, but any other btl may be able to do so. So, with that assumption, we do not need any pml level acknowledgment protocol (no extra messages). No timer is needed for retransmission as it is triggered by btl failure. Today, only error callback scenario is implemented. We should also treat btl send method return codes. To deal with message duplication, the protocol maintains a message id allowing to track received messages (hence the larger header). So any duplicated message will not be processed. Concerning the openib btl, on a multi-port system, the connection scheme is supposed to be (host 1-port 0) <==> (host 2-port 0) and (host 1-port 1) <==> (host 2-port 1) for example. This is done at btl endpoint initialization but when establishing connexion at first send attempt, the port association information is not processed. This results in a crossed connection scheme ( (host 1-port 0) <==> (host 2-port 1) and (host 1-port 1) <==> (host 2-port 0)). So, instead of having two separate rings or paths, we have 1 big ring that does not allow failover. We had to fix this to enable failover in both multi-path (same network) and multi-rail (2 separate networks) with openib. Brian, so far, we are able to switch from one failing btl to a safe one only. When there is no more btl left, we abort the job. Next step is to be able to re-establish the connection when the network is back. Mouhamed Graham, Richard L. a écrit : > What is the impact on sm, which is by far the most sensitive to latency. This > really belongs in a place other than ob1. Ob1 is supposed to provide the > lowest latency possible, and other pml's are supposed to be used for heavier > weight protocols. > > On the technical side, how do you distinguish between a lot acknowledgement > and an undelivered message ? You really don't want to try and deliver data > into user space twice, as once a receive is complete, who knows what the user > has done with that buffer ? A general treatment needs to be able to false > negatives, and attempts to deliver the data more than once. > > How are you detecting missing acknowledgements ? Are you using some sort of > timer ? > > Rich > > On 7/31/09 5:49 AM, "Mouhamed Gueye" wrote: > > Hi list, > > Here is an update on our work concerning device failover. > > As many of you suggested, we reoriented our work on ob1 rather than dr > and we now have a working prototype on top of ob1. The approach is to > store btl descriptors sent to peers and delete them when we receive > proof of delivery. So far, we rely on completion callback functions, > assuming that the message is delivered when the completion function is > called, that is the case of openib. When a btl module fails, it is > removed from the endpoint's btl list and the next one is used to > retransmit stored descriptors. No extra-message is transmitted, it only > consists in additions to the header. It has been mainly tested with two > IB modules, in both multi-rail (two separate networks) and multi-path (a > big unique network). > > You can grab and test the patch here (applies on top of the trunk) : > http://bitbucket.org/gueyem/ob1-failover/ > > To compile with failover support, just define --enable-device-failover > at configure. You can then run a benchmark, disconnect a port and see > the failover operate. > > A little latency
Re: [OMPI devel] [OT] Who's going to Helsinki?
I'll be there, however for EPVMMPI only, i.e. I arrive on Sunday. Edgar Jeff Squyres wrote: Who's going to Helsinki? Does anyone want to meet up for some sight-seeing and/or have a devel meeting? I know that some of our European developers are not attending, but if we have a day-long devel meeting, perhaps they might be motivated...? -- Edgar Gabriel Assistant Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
Re: [OMPI devel] [PATCH] Better error reporting when failing to load a component
Absolutely correct -- fixed -- thanks! On Aug 4, 2009, at 4:45 AM, Arthur Huillet wrote: Hi, Jeff Squyres wrote: > Glad it was helpful! Feel free to let us know if there's anything > else that would be helpful there -- it's easy enough to give you write > access to the wiki. Just a small thing on the CreateComponent page : "Create a directory with the component name in /mca/foo/. For the purposes of this document, we'll assume that your framework name is "bar" (i.e., /mca/foo/bar/)." This lines looks fishy to me. s/framework/component/ is probably what should be written here. Thanks -- Greetings, A. Huillet ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] [OT] Who's going to Helsinki?
Jeff Squyres wrote: Who's going to Helsinki? Does anyone want to meet up for some sight-seeing and/or have a devel meeting? I know that some of our European developers are not attending, but if we have a day-long devel meeting, perhaps they might be motivated...? I will be attending. --td
Re: [OMPI devel] [OT] Who's going to Helsinki?
On Aug 4, 2009, at 7:34 AM, Sylvain Jeaugey wrote: I bet you're refering to Euro PVM MPI 09 ? Heh -- sorry, I should have been more clear. :-) Yes, I was referring to both Euro PVM/MPI and the Forum meeting on W-F in the previous week. I'm actually *only* attending the Forum meeting (leaving early Saturday morning), but there's wiggle room in there for some OMPI-specific devel time... -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] [OT] Who's going to Helsinki?
Hi Jeff, I bet you're refering to Euro PVM MPI 09 ? If this is what you're refering to, I should attend as usual. And of course, I'm very interested in joining a devel meeting :) Sylvain On Tue, 4 Aug 2009, Jeff Squyres wrote: Who's going to Helsinki? Does anyone want to meet up for some sight-seeing and/or have a devel meeting? I know that some of our European developers are not attending, but if we have a day-long devel meeting, perhaps they might be motivated...? -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] [OT] Who's going to Helsinki?
Who's going to Helsinki? Does anyone want to meet up for some sight-seeing and/or have a devel meeting? I know that some of our European developers are not attending, but if we have a day-long devel meeting, perhaps they might be motivated...? -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] Device failover on ob1
Rolf/Mouhamed Could you get together off-list to discuss the different approaches and see if/where there is common ground. It would be nice to see an integrated solution - personally, I would rather not see two orthogonal approaches unless they can be cleanly separated. Much better if they could support each other in an intelligent fashion. On Aug 3, 2009, at 9:49 AM, Pavel Shamis (Pasha) wrote: I have not, but there should be no difference. The failover code only gets triggered when an error happens. Otherwise, there are no differences in the code paths while everything is functioning normally. Sounds good. I still did not have time to review the code. I will try to do it during this week. Pasha Rolf On 08/03/09 11:14, Pavel Shamis (Pasha) wrote: Rolf, Did you compare latency/bw for failover-enabled code VS trunk ? Pasha. Rolf Vandevaart wrote: Hi folks: As some of you know, I have also been looking into implementing failover as well. I took a different approach as I am solving the problem within the openib BTL itself. This of course means that this only works for failing from one openib BTL to another but that was our area of interest. This also means that we do not need to keep track of fragments as we get them back from the completion queue upon failure. We then extract the relevant information and repost on the other working endpoint. My work has been progressing at http://bitbucket.org/rolfv/ompi-failover . This only currently works for send semantics so you have to run with -mca btl_openib_flags 1. Rolf On 07/31/09 05:49, Mouhamed Gueye wrote: Hi list, Here is an update on our work concerning device failover. As many of you suggested, we reoriented our work on ob1 rather than dr and we now have a working prototype on top of ob1. The approach is to store btl descriptors sent to peers and delete them when we receive proof of delivery. So far, we rely on completion callback functions, assuming that the message is delivered when the completion function is called, that is the case of openib. When a btl module fails, it is removed from the endpoint's btl list and the next one is used to retransmit stored descriptors. No extra-message is transmitted, it only consists in additions to the header. It has been mainly tested with two IB modules, in both multi-rail (two separate networks) and multi-path (a big unique network). You can grab and test the patch here (applies on top of the trunk) : http://bitbucket.org/gueyem/ob1-failover/ To compile with failover support, just define --enable-device- failover at configure. You can then run a benchmark, disconnect a port and see the failover operate. A little latency increase (~ 2%) is induced by the failover layer when no failover occurs. To accelerate the failover process on openib, you can try to lower the btl_openib_ib_timeout openib parameter to 15 for example instead of 20 (default value). Mouhamed ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [PATCH] Better error reporting when failing to load a component
Hi, Jeff Squyres wrote: Glad it was helpful! Feel free to let us know if there's anything else that would be helpful there -- it's easy enough to give you write access to the wiki. Just a small thing on the CreateComponent page : "Create a directory with the component name in /mca/foo/. For the purposes of this document, we'll assume that your framework name is "bar" (i.e., /mca/foo/bar/)." This lines looks fishy to me. s/framework/component/ is probably what should be written here. Thanks -- Greetings, A. Huillet