Re: [OMPI devel] FreeBSD timer_base_open error?
On Mar 25, 2008, at 6:16 PM, Jeff Squyres wrote: "linux" is the name of the component. It looks like opal/mca/timer/ linux/timer_linux_component.c is doing some checks during component open() and returning an error if it can't be used (e.g,. if it's not on linux). The timer components are a little different than normal MCA frameworks; they *must* be compiled in libopen-pal statically, and there will only be one of them built. In this case, I'm guessing that linux was built simply because nothing else was selected to be built, but then its component_open() function failed because it didn't find /proc/cpuinfo. This is actually incorrect. The linux component looks for /proc/ cpuinfo and builds if it founds that file. There's a base component that's built if nothing else is found. The configure logic for the linux component is probably not the right thing to do -- it should probably be modified to check both for that file (there are systems that call themselves "linux" but don't have a /proc/cpuinfo) is readable and that we're actually on Linux. Brian -- Brian Barrett There is an art . . . to flying. The knack lies in learning how to throw yourself at the ground and miss. Douglas Adams, 'The Hitchhikers Guide to the Galaxy'
Re: [OMPI devel] FreeBSD timer_base_open error?
"linux" is the name of the component. It looks like opal/mca/timer/ linux/timer_linux_component.c is doing some checks during component open() and returning an error if it can't be used (e.g,. if it's not on linux). The timer components are a little different than normal MCA frameworks; they *must* be compiled in libopen-pal statically, and there will only be one of them built. In this case, I'm guessing that linux was built simply because nothing else was selected to be built, but then its component_open() function failed because it didn't find /proc/cpuinfo. Do you have the inclination to write a quickie BSD-capable timer component? There's only 3 functions to write: - get the cycle count - get the current usec time - get the CPU frequency On Mar 23, 2008, at 12:54 AM, Brad Penoff wrote: Greetings, In an MTT run just now, I'm noticing these funny output messages in the middle of an early phase: MPI install [mpi install: gcc warnings] Installing MPI: [ompi-nightly-trunk] / [1.3a1r17921] / [gcc warnings]... [pc23.netbed.icics.ubc.ca:59263] mca: base: components_open: component timer / linux open function failed [pc23.netbed.icics.ubc.ca:59264] mca: base: components_open: component timer / linux open function failed [pc23.netbed.icics.ubc.ca:59265] mca: base: components_open: component timer / linux open function failed [pc23.netbed.icics.ubc.ca:59266] mca: base: components_open: component timer / linux open function failed Has anyone ever seen these? Nonetheless, my tests proceed and run after this. This is on a FreeBSD 6.2 system, so having "linux" in the message seems interesting... any suggestions? Thanks! brad ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Proc modex change
Sounds reasonable to me. Anyone else care? On Mar 20, 2008, at 2:03 PM, Brian W. Barrett wrote: Hi all - Does anyone know why we go through the modex receive and for the local process in ompi_proc_get_info()? It doesn't seem like it's necessary, and it causes some problems on platforms that don't implement the modex (since it zeros out useful information determined during the init step). If no one has any objections, I'd like to commit the attached patch that fixes that problem. Thanks, Brian < proc_local_no_modex .diff>___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Using coverity results
I put up some notes about using the Coverity web tool on the wiki (there's a link from the front page, too): https://svn.open-mpi.org/trac/ompi/wiki/Coverity I also asked Coverity if they could install libibverbs so that we can get scanning of the openib BTL. Let's see what they say; if they can do it, they may be able to install other support libraries as well. On Mar 25, 2008, at 4:53 PM, Jeff Squyres wrote: I have started checking the Coverity results again. There's a lot of good stuff in there! But there's some false positives, as well. I *strongly* encourage developers to start checking the coverity results; it takes a little time to dig through the issues that it finds and decide whether they're real or not. If we each take just a few a day, we could be in pretty good shape for v1.3. -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
[OMPI devel] 1.2.6 testing
I mentioned on the call today that there were some bsend errors showing up in MTT on the v1.2 branch. I am now pretty sure that these are some weird artifact of my test environment. I cannot reproduce these errors manually. I did some sanity testing on 1.2.6rc3 today and it looks good to me. -- Jeff Squyres Cisco Systems
[OMPI devel] Using coverity results
I have started checking the Coverity results again. There's a lot of good stuff in there! But there's some false positives, as well. I *strongly* encourage developers to start checking the coverity results; it takes a little time to dig through the issues that it finds and decide whether they're real or not. If we each take just a few a day, we could be in pretty good shape for v1.3. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] iof/libevent failures?
Further discussion on: https://svn.open-mpi.org/trac/ompi/ticket/1253 On Mar 25, 2008, at 8:54 AM, Ralph H Castain wrote: I cannot replicate this with a debug build I was doing all my work in a debug build, Tim - may be why I didn't see the problem. There is an issue with libevent right now and pty's/select/event registration. George, Jeff, and I have been chatting about it as it breaks the Mac completely. Will let them take it from there... On 3/25/08 6:49 AM, "Tim Prins"wrote: Hi everyone, For the last couple nights ALL of our mtt runs have been failing (although the failure is masked because mpirun is returning the wrong error code) with: [odin005.cs.indiana.edu:28167] [[46567,0],0] ORTE_ERROR_LOG: Error in file base/plm_base_launch_support.c at line 161 -- mpirun was unable to start the specified application as it encountered an error. More information may be available above. -- This line is where we try to do an IOF push. It looks like it was broken somewhere between r17922 and r17926, which includes the libevent merge. I cannot replicate this with a debug build, so I thought I would throw this out before I look any further. Thanks, Tim ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
[OMPI devel] Open MPI v1.2.6rc3 has been posted
Hi All, The next release candidate (rc3) of Open MPI v1.2.6 is now up: http://www.open-mpi.org/software/ompi/v1.2/ Please run it through it's paces as best you can. -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r17951
Note: you do *not* need to re-autogen for this commit. On Mar 25, 2008, at 9:41 AM, gship...@osl.iu.edu wrote: Author: gshipman Date: 2008-03-25 09:41:09 EDT (Tue, 25 Mar 2008) New Revision: 17951 URL: https://svn.open-mpi.org/trac/ompi/changeset/17951 Log: need orted_LDFLAGS as a placeholder you will need to re autogen.sh Text files modified: trunk/orte/tools/orted/Makefile.am | 9 - 1 files changed, 8 insertions(+), 1 deletions(-) Modified: trunk/orte/tools/orted/Makefile.am = = = = = = = = == --- trunk/orte/tools/orted/Makefile.am (original) +++ trunk/orte/tools/orted/Makefile.am 2008-03-25 09:41:09 EDT (Tue, 25 Mar 2008) @@ -25,5 +25,12 @@ endif # OMPI_INSTALL_BINARIES -orted_SOURCES = orted.c +orted_SOURCES = orted.c +# the following empty orted_LDFLAGS is used +# so that the orted can be compiled statically +# by simply changing the value of this from +# nothing to -all-static in the Makefile.in +# nice for systems that don't have all the shared +# libraries on the computes +orted_LDFLAGS = orted_LDADD = $(top_builddir)/orte/libopen-rte.la ___ svn-full mailing list svn-f...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/svn-full -- Jeff Squyres Cisco Systems
Re: [OMPI devel] iof/libevent failures?
Crud, ok. I added this info to https://svn.open-mpi.org/trac/ompi/ticket/1253 ; hopefully we'll resolve it today. I guess people didn't test the libevent-merge branch before we brought it to the trunk. :-( On Mar 25, 2008, at 9:22 AM, Tim Prins wrote: I was able to replicate the failure with a debug build by running mpirun through a batch job. I then added the parameter you gave me, and it worked fine with the parameter. Thanks, Tim Jeff Squyres wrote: We're chasing down a problem that we're having on OSX w.r.t. libevent, too -- can you try running with: --mca opal_event_include select and see if that fixes the problem for you? On Mar 25, 2008, at 8:49 AM, Tim Prins wrote: Hi everyone, For the last couple nights ALL of our mtt runs have been failing (although the failure is masked because mpirun is returning the wrong error code) with: [odin005.cs.indiana.edu:28167] [[46567,0],0] ORTE_ERROR_LOG: Error in file base/plm_base_launch_support.c at line 161 -- mpirun was unable to start the specified application as it encountered an error. More information may be available above. -- This line is where we try to do an IOF push. It looks like it was broken somewhere between r17922 and r17926, which includes the libevent merge. I cannot replicate this with a debug build, so I thought I would throw this out before I look any further. Thanks, Tim ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] iof/libevent failures?
I was able to replicate the failure with a debug build by running mpirun through a batch job. I then added the parameter you gave me, and it worked fine with the parameter. Thanks, Tim Jeff Squyres wrote: We're chasing down a problem that we're having on OSX w.r.t. libevent, too -- can you try running with: --mca opal_event_include select and see if that fixes the problem for you? On Mar 25, 2008, at 8:49 AM, Tim Prins wrote: Hi everyone, For the last couple nights ALL of our mtt runs have been failing (although the failure is masked because mpirun is returning the wrong error code) with: [odin005.cs.indiana.edu:28167] [[46567,0],0] ORTE_ERROR_LOG: Error in file base/plm_base_launch_support.c at line 161 -- mpirun was unable to start the specified application as it encountered an error. More information may be available above. -- This line is where we try to do an IOF push. It looks like it was broken somewhere between r17922 and r17926, which includes the libevent merge. I cannot replicate this with a debug build, so I thought I would throw this out before I look any further. Thanks, Tim ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] iof/libevent failures?
We're chasing down a problem that we're having on OSX w.r.t. libevent, too -- can you try running with: --mca opal_event_include select and see if that fixes the problem for you? On Mar 25, 2008, at 8:49 AM, Tim Prins wrote: Hi everyone, For the last couple nights ALL of our mtt runs have been failing (although the failure is masked because mpirun is returning the wrong error code) with: [odin005.cs.indiana.edu:28167] [[46567,0],0] ORTE_ERROR_LOG: Error in file base/plm_base_launch_support.c at line 161 -- mpirun was unable to start the specified application as it encountered an error. More information may be available above. -- This line is where we try to do an IOF push. It looks like it was broken somewhere between r17922 and r17926, which includes the libevent merge. I cannot replicate this with a debug build, so I thought I would throw this out before I look any further. Thanks, Tim ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
[OMPI devel] iof/libevent failures?
Hi everyone, For the last couple nights ALL of our mtt runs have been failing (although the failure is masked because mpirun is returning the wrong error code) with: [odin005.cs.indiana.edu:28167] [[46567,0],0] ORTE_ERROR_LOG: Error in file base/plm_base_launch_support.c at line 161 -- mpirun was unable to start the specified application as it encountered an error. More information may be available above. -- This line is where we try to do an IOF push. It looks like it was broken somewhere between r17922 and r17926, which includes the libevent merge. I cannot replicate this with a debug build, so I thought I would throw this out before I look any further. Thanks, Tim
[OMPI devel] Return code and error message problems
Hi, Something went wrong last night and all our MTT tests had the following output: [odin005.cs.indiana.edu:28167] [[46567,0],0] ORTE_ERROR_LOG: Error in file base/plm_base_launch_support.c at line 161 -- mpirun was unable to start the specified application as it encountered an error. More information may be available above. -- I have not tracked down what caused this, but the more immediate problem is that after giving this error mpirun returned '0' instead of a more sane error value. Also, when running the test 'orte/test/mpi/abort' I get the error output: -- mpirun has exited due to process rank 1 with PID 17822 on node odin013 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -- Which is wrong, it should be saying that the process was aborted. It looks like somehow the job state is being set to ORTE_JOB_STATE_ABORTED_WO_SYNC instead of ORTE_JOB_STATE_ABORTED. Thanks, Tim