[OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless
Hi I am trying to run open-mpi 1.3.3. between a linux box running ubuntu server v.9.04 and a Macintosh. I have configured openmpi with the following options.: ./configure --prefix=/usr/local/ --enable-heterogeneous --disable-shared --enable-static When both the machines are connected to the network via ethernet cables openmpi works fine. But when I switch the linux box to a wireless adapter i can reach (ping) the macintosh but openmpi hangs on a hello world program. I ran : /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H localhost,10.11.14.205 /tmp/back it hangs on a send receive function between the two ends. All my firewalls are turned off at the macintosh end. PLEASE HELP ASAP> PLEASE let me know how to debug it further.. The following is the error dump fuji:src pallabdatta$ /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca btl tcp,self --mca OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H localhost,10.11.14.205 /tmp/hello [fuji.local:01316] mca: base: components_open: Looking for btl components [fuji.local:01316] mca: base: components_open: opening btl components [fuji.local:01316] mca: base: components_open: found loaded component self [fuji.local:01316] mca: base: components_open: component self has no register function [fuji.local:01316] mca: base: components_open: component self open function successful [fuji.local:01316] mca: base: components_open: found loaded component tcp [fuji.local:01316] mca: base: components_open: component tcp has no register function [fuji.local:01316] mca: base: components_open: component tcp open function successful [fuji.local:01316] select: initializing btl component self [fuji.local:01316] select: init of component self returned success [fuji.local:01316] select: initializing btl component tcp [fuji.local:01316] select: init of component tcp returned success [apex-backpack:04753] mca: base: components_open: Looking for btl components [apex-backpack:04753] mca: base: components_open: opening btl components [apex-backpack:04753] mca: base: components_open: found loaded component self [apex-backpack:04753] mca: base: components_open: component self has no register function [apex-backpack:04753] mca: base: components_open: component self open function successful [apex-backpack:04753] mca: base: components_open: found loaded component tcp [apex-backpack:04753] mca: base: components_open: component tcp has no register function [apex-backpack:04753] mca: base: components_open: component tcp open function successful [apex-backpack:04753] select: initializing btl component self [apex-backpack:04753] select: init of component self returned success [apex-backpack:04753] select: initializing btl component tcp [apex-backpack:04753] select: init of component tcp returned success Process 0 on fuji.local out of 2 Process 1 on apex-backpack out of 2 [apex-backpack:04753] btl: tcp: attempting to connect() to address 10.11.14.203 on port 9360 regards, pallab
Re: [OMPI devel] MPIR_Breakpoint visibility
Hi Jeff, Sorry about the ambiguity. I just had another conversation with our TotalView person and the problem -seems- to be unrelated to OMPI. Guess I jumped the gun... Thanks, Samuel K. Gutierrez On Sep 21, 2009, at 8:58 AM, Jeff Squyres wrote: Can you more precisely define "not working properly"? On Sep 21, 2009, at 10:26 AM, Samuel K. Gutierrez wrote: Hi, According to our TotalView person, PGI and Intel versions of OMPI 1.3.3 are not working properly. She noted that 1.2.8 and 1.3.2 work fine. Thanks, Samuel K. Gutierrez On Sep 21, 2009, at 7:19 AM, Terry Dontje wrote: > Ralph Castain wrote: >> I see it declared "extern" in orte/tools/orterun/debuggers.h, but >> not DECLSPEC'd >> >> FWIW: LANL uses intel compilers + totalview on a regular basis, and >> I have yet to hear of an issue. >> > It actually will work if you attach to the job or if you are not > relying on the MPIR_Breakpoint to actually stop execution. > > --td > >> On Sep 21, 2009, at 7:03 AM, Terry Dontje wrote: >> >>> I was kind of amazed no one else managed to run into this but it >>> was brought to my attention that compiling OMPI with Intel >>> compilers and visibility on that the MPIR_Breakpoint symbol was >>> not being exposed. I am assuming this is due to MPIR_Breakpoint >>> not being ORTE or OMPI_DECLSPEC'd >>> Do others agree or am I missing something obvious here? >>> >>> Interestingly enough, it doesn't look like gcc, pgi, pathscale or >>> sun compilers are hiding the MPIR_Breakpoint symbol. >>> --td >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Dynamic languages, dlopen() issues, and symbol visibility of libtool ltdl API in current trunk
As a workaround, Lisandro could just pre-seed the cache variables of the respective configure tests that come out wrong. ./configure lt_cv_dlopen_self=yes lt_cv_dlopen_self_static=yes HTH. Cheers, Ralf * Jeff Squyres wrote on Mon, Sep 21, 2009 at 02:45:28PM CEST: > Ick; I appreciate Lisandro's quandry, but don't quite know what to do. > > How about keeping libltdl fvisibility=hidden inside mpi4py? > > > On Sep 17, 2009, at 11:16 AM, Josh Hursey wrote: > > >So I started down this road a couple months ago. I was using the > >lt_dlopen() and friends in the OPAL CRS self module. The visibility > >changes broke that functionality. The one solution that I started > >implementing was precisely what you suggested, wrapping a subset the > >libtool calls and prefixing them with opal_*. The email thread is > >below: > > http://www.open-mpi.org/community/lists/devel/2009/07/6531.php > > > >The problem that I hit was that libtool's build system did not play > >well with the visibility symbols. This caused dlopen to be disabled > >incorrectly. The libtool folks have a patch and, I believe, they are > >planning on incorporating in the next release. The email thread is > >below: > > http://thread.gmane.org/gmane.comp.gnu.libtool.patches/9446 > > > >So we would (others can speak up if not) certainly consider such a > >wrapper, but I think we need to wait for the next libtool release > >(unless there is other magic we can do) before it would be usable. > > > >Do others have any other ideas on how we might get around this in the > >mean time? > > > >-- Josh > > > > > >On Sep 16, 2009, at 5:59 PM, Lisandro Dalcin wrote: > > > >> Hi all.. I have to contact you again about the issues related to > >> dlopen()ing libmpi with RTLD_LOCAL, as many dynamic languages > >(Python > >> in my case) do. > >> > >> So far, I've been able to manage the issues (despite the "do > >nothing" > >> policy from Open MPI devs, which I understand) in a more or less > >> portable manner by taking advantage of the availability of libtool > >> ltdl symbols in the Open MPI libraries (specifically, in > >libopen-pal). > >> For reference, all this hackery is here: > >> http://code.google.com/p/mpi4py/source/browse/trunk/src/compat/openmpi.h > >> > >> However, I noticed that in current trunk (v1.4, IIUC) things have > >> changed and libtool symbols are not externally available. Again, I > >> understand the reason and acknowledge that such change is a really > >> good thing. However, this change has broken all my hackery for > >> dlopen()ing libmpi before the call to MPI_Init(). > >> > >> Is there any chance that libopen-pal could provide some properly > >> prefixed (let say, using "opal_" as a prefix) wrapper calls to a > >small > >> subset of the libtool ltdl API? The following set of wrapper calls > >> would is the minimum required to properly load libmpi in a portable > >> manner and cleanup resources (let me abuse of my previous suggestion > >> and add the opal_ prefix): > >> > >> opal_lt_dlinit() > >> opal_lt_dlexit() > >> > >> opal_lt_dladvise_init(a) > >> opal_lt_dladvise_destroy(a) > >> opal_lt_dladvise_global(a) > >> opal_lt_dladvise_ext(a) > >> > >> opal_lt_dlopenadvise(n,a) > >> opal_lt_dlclose(h) > >> > >> Any chance this request could be considered? I would really like to > >> have this before any Open MPI tarball get released without libtool > >> symbols exposed...
Re: [hwloc-devel] last API possible changes
Samuel Thibault, le Mon 21 Sep 2009 19:33:11 +0200, a écrit : > Jeff Squyres, le Mon 21 Sep 2009 12:31:41 -0400, a écrit : > > On Sep 21, 2009, at 10:42 AM, Samuel Thibault wrote: > > >It's part of the language starting from C99 only. An application could > > >enable non-C99 mode where it becomes undefined, you can never know. > > > > That is a decade old, no? ;-) > > Yes, but existing software tends to still not evolve. I'm still seeing > software using the old termio interface while it dates at least back it = termios, the newer one > 1988. Samuel
Re: [hwloc-devel] last API possible changes
Jeff Squyres, le Mon 21 Sep 2009 12:31:41 -0400, a écrit : > On Sep 21, 2009, at 10:42 AM, Samuel Thibault wrote: > >It's part of the language starting from C99 only. An application could > >enable non-C99 mode where it becomes undefined, you can never know. > > That is a decade old, no? ;-) Yes, but existing software tends to still not evolve. I'm still seeing software using the old termio interface while it dates at least back 1988. > (Sorry -- I wasn't trying to be a jerk; just trying to be thorough...) No problem. Actually, I had followed your reasoning at the time I wrote that part of the code, I'm just repeating what I've thought in my head at the time :) > Is it possible that our use of restrict in cpuset- > bits.h could come to a conclusion that is different than the > underlying compiler (e.g., the underlying compiler needs __restrict)? I'm not sure to understand what you mean. What could happen is that gcc stops understanding __restrict while it has been understanding it since 2.95; I doubt that will ever happen. Another very low-probability possibility to get something wrong is if a compiler defines __STDC_VERSION__ to a value greater than 199901L but doesn't accept restrict; that would really be a compiler bug. The last possibility is restrict being #defined to something not being the standard restrict qualifier. I've just dropped the #if defined restrict part to avoid it, that's not a big loss. > Alternatively, is the restrict optimization really worth it here? Re-reading what we use it for at the moment, there is not many optimizations to be done, but now that I've removed the only case that could be problematic, it shouldn't be a problem. Samuel
Re: [hwloc-devel] last API possible changes
Jeff Squyres, le Mon 21 Sep 2009 10:04:21 -0400, a écrit : > On Sep 21, 2009, at 9:40 AM, Samuel Thibault wrote: > >> So it should be ok to use AC_C_RESTRICT then, right? > > > >But then we can't expose restrict in installed headers since we don't > >know _whether_ and how it is defined. > > > > Understood, but is that really our problem? "restrict" is part of the > C language, so portable applications should be able to handle it in > headers that they import, right? It's part of the language starting from C99 only. An application could enable non-C99 mode where it becomes undefined, you can never know. > Alternatively, this whole block in cpuset-bits.h could be wrapped in > an "#ifndef restrict", right?: That will work only if libraries possibly defining restrict and included after hwloc do the same. You may argue that then it's their matter. The only library I see defining restrict in my /usr/include does it without #ifndef restrict, I'm not sure we want to fight this. Anyway, #defining restrict in an installed header means that we'd have to copy the autoconf stuff into it anyway as to my knowledge autoconf is not flexible enough to only output that to some config.h.in header. That hence doesn't buy ourself much compared to the current situation, except that we'd have restrict instead of __hwloc_restrict. Risking conflicts on the definition of restrict just for that seems too much to me. > >> So I'd call it a "feature" if hwloc defines "restrict" to one thing > >> and then some other header file defines it to another. :-) > > > >? > >That would make applications get a warning just because they are for > >instance using at the same time two libraries which both define > >restrict > >to different things. > > > > Yes -- and that's a Bad Thing. The differences between those two > libraries should be resolved, lest unexpected things occur because of > a mismatch between what the header exports (and might be redefined) > and what the compiled back-end library expects, no? You may not be able to resolve the difference: depending on the kind of detection of the compiler etc. etc. you may end up with restrict defined to __restrict or to something else. And as soon as one improves its detection of the compiler, the other(s!) will have to harmonize, etc. Not really sustainable. Samuel
Re: [OMPI devel] MPIR_Breakpoint visibility
Hi, According to our TotalView person, PGI and Intel versions of OMPI 1.3.3 are not working properly. She noted that 1.2.8 and 1.3.2 work fine. Thanks, Samuel K. Gutierrez On Sep 21, 2009, at 7:19 AM, Terry Dontje wrote: Ralph Castain wrote: I see it declared "extern" in orte/tools/orterun/debuggers.h, but not DECLSPEC'd FWIW: LANL uses intel compilers + totalview on a regular basis, and I have yet to hear of an issue. It actually will work if you attach to the job or if you are not relying on the MPIR_Breakpoint to actually stop execution. --td On Sep 21, 2009, at 7:03 AM, Terry Dontje wrote: I was kind of amazed no one else managed to run into this but it was brought to my attention that compiling OMPI with Intel compilers and visibility on that the MPIR_Breakpoint symbol was not being exposed. I am assuming this is due to MPIR_Breakpoint not being ORTE or OMPI_DECLSPEC'd Do others agree or am I missing something obvious here? Interestingly enough, it doesn't look like gcc, pgi, pathscale or sun compilers are hiding the MPIR_Breakpoint symbol. --td ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] detecting regcache_clean deadlocks in Open-MX
Jeff Squyres wrote: > Do you just want to wait for the ummunotify stuff in OMPI? I'm half > done making a merged "linux" memory component (i.e., it merges the > ptmalloc2 component with the new ummunotify stuff). > > It won't help for kernels <2.6.32, of course. :-) Yeah that's another solution for bleeding-edge distribs :) But it's not even merged in 2.6.32 yet, and I hope the perfcounter guys won't successfully convince Roland of reimplementing ummunotify as a perfcounter feature. The 2.6.32 merge window is closing soon... Brice
Re: [OMPI devel] detecting regcache_clean deadlocks in Open-MX
Do you just want to wait for the ummunotify stuff in OMPI? I'm half done making a merged "linux" memory component (i.e., it merges the ptmalloc2 component with the new ummunotify stuff). It won't help for kernels <2.6.32, of course. :-) On Sep 21, 2009, at 9:11 AM, Brice Goglin wrote: Jeff Squyres wrote: > On Sep 21, 2009, at 5:50 AM, Brice Goglin wrote: > >> I am playing with mx__regcache_clean() in Open-MX so as to have OpenMPI >> cleanup the Open-MX regcache when needed. It causes some deadlocks since >> OpenMPI intercepts Open-MX' own free() calls. Is there a "safe" way to >> have Open-MX free/munmap calls not invoke OpenMPI interception hooks? >> > > Not ATM, no. > >> Or >> is there a way to detect the caller of free/munmap so that my >> regcache_clean does nothing in this case? Otherwise, I guess I'll have >> to add a private malloc implementation inside Open-MX and hope OpenMPI >> won't see it. >> > > > Can you structure your code to not call free/munmap inside the handler? The first problem is actually about thread-safety. Most Open-MX functions, including mx_regcache_clean(), take a pthread mutex. So I would have to move all free/munmap outside of the locked section. That's probably feasible but requires a lot of work :) Brice ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] MPIR_Breakpoint visibility
Ralph Castain wrote: I see it declared "extern" in orte/tools/orterun/debuggers.h, but not DECLSPEC'd FWIW: LANL uses intel compilers + totalview on a regular basis, and I have yet to hear of an issue. It actually will work if you attach to the job or if you are not relying on the MPIR_Breakpoint to actually stop execution. --td On Sep 21, 2009, at 7:03 AM, Terry Dontje wrote: I was kind of amazed no one else managed to run into this but it was brought to my attention that compiling OMPI with Intel compilers and visibility on that the MPIR_Breakpoint symbol was not being exposed. I am assuming this is due to MPIR_Breakpoint not being ORTE or OMPI_DECLSPEC'd Do others agree or am I missing something obvious here? Interestingly enough, it doesn't look like gcc, pgi, pathscale or sun compilers are hiding the MPIR_Breakpoint symbol. --td ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] MPIR_Breakpoint visibility
Does declspec matter for executables? (I don't recall) On Sep 21, 2009, at 9:15 AM, Ralph Castain wrote: I see it declared "extern" in orte/tools/orterun/debuggers.h, but not DECLSPEC'd FWIW: LANL uses intel compilers + totalview on a regular basis, and I have yet to hear of an issue. On Sep 21, 2009, at 7:03 AM, Terry Dontje wrote: > I was kind of amazed no one else managed to run into this but it was > brought to my attention that compiling OMPI with Intel compilers and > visibility on that the MPIR_Breakpoint symbol was not being > exposed. I am assuming this is due to MPIR_Breakpoint not being > ORTE or OMPI_DECLSPEC'd > Do others agree or am I missing something obvious here? > > Interestingly enough, it doesn't look like gcc, pgi, pathscale or > sun compilers are hiding the MPIR_Breakpoint symbol. > --td > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] MPIR_Breakpoint visibility
I see it declared "extern" in orte/tools/orterun/debuggers.h, but not DECLSPEC'd FWIW: LANL uses intel compilers + totalview on a regular basis, and I have yet to hear of an issue. On Sep 21, 2009, at 7:03 AM, Terry Dontje wrote: I was kind of amazed no one else managed to run into this but it was brought to my attention that compiling OMPI with Intel compilers and visibility on that the MPIR_Breakpoint symbol was not being exposed. I am assuming this is due to MPIR_Breakpoint not being ORTE or OMPI_DECLSPEC'd Do others agree or am I missing something obvious here? Interestingly enough, it doesn't look like gcc, pgi, pathscale or sun compilers are hiding the MPIR_Breakpoint symbol. --td ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [hwloc-devel] last API possible changes
Jeff Squyres, le Mon 21 Sep 2009 08:51:35 -0400, a écrit : > On Sep 21, 2009, at 8:44 AM, Samuel Thibault wrote: > > >> FWIW, is there a reason we're not using AC_C_RESTRICT in > >> configure.ac? This allows you to use "restrict" in C code > >everywhere; > >> it'll be #defined to something acceptable by the compiler if > >> "restrict" itself is not. > > > >Our __hwloc_restrict macro is actually a copy/paste of AC_C_RESTRICT's > >tinkering. Ah, wait, no, I'm mistaking with something else in another project. Looking closer, this definition is mine. Note btw that the autoconf license makes an exception for code output from autoconf scripts, the GPL doesn't apply to them, there is “unlimited permission to copy, distribute, and modify” it. > >The problem is precisely that AC_C_RESTRICT provides "restrict", > >and not another keyword, so that using it in installed headers > >may conflict with other headers' tinkering about restrict. Yes, > >configure is meant to detect such kind of things, but it can not > >know which variety of headers the user will want to include from > >its application, and any of them could want to define restrict to > >something else. > > Would it ever be sane to use one value of restrict in hwloc and a > different value in an upper-level application? That's not a problem since it's just an optimization & validity checking qualifier. Samuel
Re: [OMPI devel] detecting regcache_clean deadlocks in Open-MX
Jeff Squyres wrote: > On Sep 21, 2009, at 5:50 AM, Brice Goglin wrote: > >> I am playing with mx__regcache_clean() in Open-MX so as to have OpenMPI >> cleanup the Open-MX regcache when needed. It causes some deadlocks since >> OpenMPI intercepts Open-MX' own free() calls. Is there a "safe" way to >> have Open-MX free/munmap calls not invoke OpenMPI interception hooks? >> > > Not ATM, no. > >> Or >> is there a way to detect the caller of free/munmap so that my >> regcache_clean does nothing in this case? Otherwise, I guess I'll have >> to add a private malloc implementation inside Open-MX and hope OpenMPI >> won't see it. >> > > > Can you structure your code to not call free/munmap inside the handler? The first problem is actually about thread-safety. Most Open-MX functions, including mx_regcache_clean(), take a pthread mutex. So I would have to move all free/munmap outside of the locked section. That's probably feasible but requires a lot of work :) Brice
[OMPI devel] MPIR_Breakpoint visibility
I was kind of amazed no one else managed to run into this but it was brought to my attention that compiling OMPI with Intel compilers and visibility on that the MPIR_Breakpoint symbol was not being exposed. I am assuming this is due to MPIR_Breakpoint not being ORTE or OMPI_DECLSPEC'd Do others agree or am I missing something obvious here? Interestingly enough, it doesn't look like gcc, pgi, pathscale or sun compilers are hiding the MPIR_Breakpoint symbol. --td
Re: [hwloc-devel] last API possible changes
On Sep 21, 2009, at 8:44 AM, Samuel Thibault wrote: > FWIW, is there a reason we're not using AC_C_RESTRICT in > configure.ac? This allows you to use "restrict" in C code everywhere; > it'll be #defined to something acceptable by the compiler if > "restrict" itself is not. Our __hwloc_restrict macro is actually a copy/paste of AC_C_RESTRICT's tinkering. Er... that's a license violation, right? Doesn't that taint hwloc into making it GPL? (good thing we haven't released yet!) :-( The problem is precisely that AC_C_RESTRICT provides "restrict", and not another keyword, so that using it in installed headers may conflict with other headers' tinkering about restrict. Yes, configure is meant to detect such kind of things, but it can not know which variety of headers the user will want to include from its application, and any of them could want to define restrict to something else. Would it ever be sane to use one value of restrict in hwloc and a different value in an upper-level application? -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] Dynamic languages, dlopen() issues, and symbol visibility of libtool ltdl API in current trunk
Ick; I appreciate Lisandro's quandry, but don't quite know what to do. How about keeping libltdl fvisibility=hidden inside mpi4py? On Sep 17, 2009, at 11:16 AM, Josh Hursey wrote: So I started down this road a couple months ago. I was using the lt_dlopen() and friends in the OPAL CRS self module. The visibility changes broke that functionality. The one solution that I started implementing was precisely what you suggested, wrapping a subset the libtool calls and prefixing them with opal_*. The email thread is below: http://www.open-mpi.org/community/lists/devel/2009/07/6531.php The problem that I hit was that libtool's build system did not play well with the visibility symbols. This caused dlopen to be disabled incorrectly. The libtool folks have a patch and, I believe, they are planning on incorporating in the next release. The email thread is below: http://thread.gmane.org/gmane.comp.gnu.libtool.patches/9446 So we would (others can speak up if not) certainly consider such a wrapper, but I think we need to wait for the next libtool release (unless there is other magic we can do) before it would be usable. Do others have any other ideas on how we might get around this in the mean time? -- Josh On Sep 16, 2009, at 5:59 PM, Lisandro Dalcin wrote: > Hi all.. I have to contact you again about the issues related to > dlopen()ing libmpi with RTLD_LOCAL, as many dynamic languages (Python > in my case) do. > > So far, I've been able to manage the issues (despite the "do nothing" > policy from Open MPI devs, which I understand) in a more or less > portable manner by taking advantage of the availability of libtool > ltdl symbols in the Open MPI libraries (specifically, in libopen- pal). > For reference, all this hackery is here: > http://code.google.com/p/mpi4py/source/browse/trunk/src/compat/openmpi.h > > However, I noticed that in current trunk (v1.4, IIUC) things have > changed and libtool symbols are not externally available. Again, I > understand the reason and acknowledge that such change is a really > good thing. However, this change has broken all my hackery for > dlopen()ing libmpi before the call to MPI_Init(). > > Is there any chance that libopen-pal could provide some properly > prefixed (let say, using "opal_" as a prefix) wrapper calls to a small > subset of the libtool ltdl API? The following set of wrapper calls > would is the minimum required to properly load libmpi in a portable > manner and cleanup resources (let me abuse of my previous suggestion > and add the opal_ prefix): > > opal_lt_dlinit() > opal_lt_dlexit() > > opal_lt_dladvise_init(a) > opal_lt_dladvise_destroy(a) > opal_lt_dladvise_global(a) > opal_lt_dladvise_ext(a) > > opal_lt_dlopenadvise(n,a) > opal_lt_dlclose(h) > > Any chance this request could be considered? I would really like to > have this before any Open MPI tarball get released without libtool > symbols exposed... > > > -- > Lisandro Dalcín > --- > Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC) > Instituto de Desarrollo Tecnológico para la Industria Química (INTEC) > Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) > PTLC - Güemes 3450, (3000) Santa Fe, Argentina > Tel/Fax: +54-(0)342-451.1594 > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com
Re: [hwloc-devel] last API possible changes
Jeff Squyres, le Mon 21 Sep 2009 08:22:06 -0400, a écrit : > FWIW, is there a reason we're not using AC_C_RESTRICT in > configure.ac? This allows you to use "restrict" in C code everywhere; > it'll be #defined to something acceptable by the compiler if > "restrict" itself is not. Our __hwloc_restrict macro is actually a copy/paste of AC_C_RESTRICT's tinkering. The problem is precisely that AC_C_RESTRICT provides "restrict", and not another keyword, so that using it in installed headers may conflict with other headers' tinkering about restrict. Yes, configure is meant to detect such kind of things, but it can not know which variety of headers the user will want to include from its application, and any of them could want to define restrict to something else. Samuel
Re: [OMPI devel] detecting regcache_clean deadlocks in Open-MX
On Sep 21, 2009, at 5:50 AM, Brice Goglin wrote: I am playing with mx__regcache_clean() in Open-MX so as to have OpenMPI cleanup the Open-MX regcache when needed. It causes some deadlocks since OpenMPI intercepts Open-MX' own free() calls. Is there a "safe" way to have Open-MX free/munmap calls not invoke OpenMPI interception hooks? Not ATM, no. Or is there a way to detect the caller of free/munmap so that my regcache_clean does nothing in this case? Otherwise, I guess I'll have to add a private malloc implementation inside Open-MX and hope OpenMPI won't see it. Can you structure your code to not call free/munmap inside the handler? -- Jeff Squyres jsquy...@cisco.com
Re: [hwloc-devel] last API possible changes
On Sep 20, 2009, at 6:12 PM, Samuel Thibault wrote: > Also, we have __hwloc_restrict everywhere in the public API, but also in > the manpages? Should we convert the latter into a regular "restrict" > keyword ? I had tried before already through the .cfg and that didn't work. Since we now have our own Makefile rules, I've just added a sed call, which also does the same for inline btw. FWIW, is there a reason we're not using AC_C_RESTRICT in configure.ac? This allows you to use "restrict" in C code everywhere; it'll be #defined to something acceptable by the compiler if "restrict" itself is not. -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] Deadlock with comm_create since cid allocator change
You were faster to fix the bug than I was to send my bug report :-) So I confirm : this fixes the problem. Thanks ! Sylvain On Mon, 21 Sep 2009, Edgar Gabriel wrote: what version of OpenMPI did you use? Patch #21970 should have fixed this issue on the trunk... Thanks Edgar Sylvain Jeaugey wrote: Hi list, We are currently experiencing deadlocks when using communicators other than MPI_COMM_WORLD. So we made a very simple reproducer (Comm_create then MPI_Barrier on the communicator - see end of e-mail). We can reproduce the deadlock only with openib and with at least 8 cores (no success with sm) and after ~20 runs average. Using larger number of cores greatly increases the occurence of the deadlock. When the deadlock occurs, every even process is stuck in MPI_Finalize and every odd process is in MPI_Barrier. So we tracked the bug in the changesets and found out that this patch seem to have introduced the bug : user:brbarret date:Tue Aug 25 15:13:31 2009 + summary: Per discussion in ticket #2009, temporarily disable the block CID allocation algorithms until they properly reuse CIDs. Reverting to the non multi-thread cid allocator makes the deadlock disappear. I tried to dig further and understand why this makes a difference, with no luck. If anyone can figure out what's happening, that would be great ... Thanks, Sylvain #include #include int main(int argc, char **argv) { int rank, numTasks; int range[3]; MPI_Comm testComm, dupComm; MPI_Group orig_group, new_group; MPI_Init(, ); MPI_Comm_rank(MPI_COMM_WORLD, ); MPI_Comm_size(MPI_COMM_WORLD, ); MPI_Comm_group(MPI_COMM_WORLD, _group); range[0] = 0; /* first rank */ range[1] = numTasks - 1; /* last rank */ range[2] = 1; /* stride */ MPI_Group_range_incl(orig_group, 1, , _group); MPI_Comm_create(MPI_COMM_WORLD, new_group, ); MPI_Barrier(testComm); MPI_Finalize(); return 0; } ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Edgar Gabriel Assistant Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Deadlock with comm_create since cid allocator change
what version of OpenMPI did you use? Patch #21970 should have fixed this issue on the trunk... Thanks Edgar Sylvain Jeaugey wrote: Hi list, We are currently experiencing deadlocks when using communicators other than MPI_COMM_WORLD. So we made a very simple reproducer (Comm_create then MPI_Barrier on the communicator - see end of e-mail). We can reproduce the deadlock only with openib and with at least 8 cores (no success with sm) and after ~20 runs average. Using larger number of cores greatly increases the occurence of the deadlock. When the deadlock occurs, every even process is stuck in MPI_Finalize and every odd process is in MPI_Barrier. So we tracked the bug in the changesets and found out that this patch seem to have introduced the bug : user:brbarret date:Tue Aug 25 15:13:31 2009 + summary: Per discussion in ticket #2009, temporarily disable the block CID allocation algorithms until they properly reuse CIDs. Reverting to the non multi-thread cid allocator makes the deadlock disappear. I tried to dig further and understand why this makes a difference, with no luck. If anyone can figure out what's happening, that would be great ... Thanks, Sylvain #include #include int main(int argc, char **argv) { int rank, numTasks; int range[3]; MPI_Comm testComm, dupComm; MPI_Group orig_group, new_group; MPI_Init(, ); MPI_Comm_rank(MPI_COMM_WORLD, ); MPI_Comm_size(MPI_COMM_WORLD, ); MPI_Comm_group(MPI_COMM_WORLD, _group); range[0] = 0; /* first rank */ range[1] = numTasks - 1; /* last rank */ range[2] = 1; /* stride */ MPI_Group_range_incl(orig_group, 1, , _group); MPI_Comm_create(MPI_COMM_WORLD, new_group, ); MPI_Barrier(testComm); MPI_Finalize(); return 0; } ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Edgar Gabriel Assistant Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
[OMPI devel] Deadlock with comm_create since cid allocator change
Hi list, We are currently experiencing deadlocks when using communicators other than MPI_COMM_WORLD. So we made a very simple reproducer (Comm_create then MPI_Barrier on the communicator - see end of e-mail). We can reproduce the deadlock only with openib and with at least 8 cores (no success with sm) and after ~20 runs average. Using larger number of cores greatly increases the occurence of the deadlock. When the deadlock occurs, every even process is stuck in MPI_Finalize and every odd process is in MPI_Barrier. So we tracked the bug in the changesets and found out that this patch seem to have introduced the bug : user:brbarret date:Tue Aug 25 15:13:31 2009 + summary: Per discussion in ticket #2009, temporarily disable the block CID allocation algorithms until they properly reuse CIDs. Reverting to the non multi-thread cid allocator makes the deadlock disappear. I tried to dig further and understand why this makes a difference, with no luck. If anyone can figure out what's happening, that would be great ... Thanks, Sylvain #include #include int main(int argc, char **argv) { int rank, numTasks; int range[3]; MPI_Comm testComm, dupComm; MPI_Group orig_group, new_group; MPI_Init(, ); MPI_Comm_rank(MPI_COMM_WORLD, ); MPI_Comm_size(MPI_COMM_WORLD, ); MPI_Comm_group(MPI_COMM_WORLD, _group); range[0] = 0; /* first rank */ range[1] = numTasks - 1; /* last rank */ range[2] = 1; /* stride */ MPI_Group_range_incl(orig_group, 1, , _group); MPI_Comm_create(MPI_COMM_WORLD, new_group, ); MPI_Barrier(testComm); MPI_Finalize(); return 0; }
[OMPI devel] detecting regcache_clean deadlocks in Open-MX
Hello, I am playing with mx__regcache_clean() in Open-MX so as to have OpenMPI cleanup the Open-MX regcache when needed. It causes some deadlocks since OpenMPI intercepts Open-MX' own free() calls. Is there a "safe" way to have Open-MX free/munmap calls not invoke OpenMPI interception hooks? Or is there a way to detect the caller of free/munmap so that my regcache_clean does nothing in this case? Otherwise, I guess I'll have to add a private malloc implementation inside Open-MX and hope OpenMPI won't see it. thanks, Brice