Re: [OMPI devel] FreeBSD timer_base_open error?

2008-03-25 Thread Brian Barrett

On Mar 25, 2008, at 6:16 PM, Jeff Squyres wrote:

"linux" is the name of the component.  It looks like opal/mca/timer/
linux/timer_linux_component.c is doing some checks during component
open() and returning an error if it can't be used (e.g,. if it's not
on linux).

The timer components are a little different than normal MCA
frameworks; they *must* be compiled in libopen-pal statically, and
there will only be one of them built.

In this case, I'm guessing that linux was built simply because nothing
else was selected to be built, but then its component_open() function
failed because it didn't find /proc/cpuinfo.



This is actually incorrect.  The linux component looks for /proc/ 
cpuinfo and builds if it founds that file.  There's a base component  
that's built if nothing else is found.  The configure logic for the  
linux component is probably not the right thing to do -- it should  
probably be modified to check both for that file (there are systems  
that call themselves "linux" but don't have a /proc/cpuinfo) is  
readable and that we're actually on Linux.


Brian

--
  Brian Barrett

  There is an art . . . to flying. The knack lies in learning how to
  throw yourself at the ground and miss.
  Douglas Adams, 'The Hitchhikers Guide to the Galaxy'





Re: [OMPI devel] FreeBSD timer_base_open error?

2008-03-25 Thread Jeff Squyres
"linux" is the name of the component.  It looks like opal/mca/timer/ 
linux/timer_linux_component.c is doing some checks during component  
open() and returning an error if it can't be used (e.g,. if it's not  
on linux).


The timer components are a little different than normal MCA  
frameworks; they *must* be compiled in libopen-pal statically, and  
there will only be one of them built.


In this case, I'm guessing that linux was built simply because nothing  
else was selected to be built, but then its component_open() function  
failed because it didn't find /proc/cpuinfo.


Do you have the inclination to write a quickie BSD-capable timer  
component?  There's only 3 functions to write:


- get the cycle count
- get the current usec time
- get the CPU frequency



On Mar 23, 2008, at 12:54 AM, Brad Penoff wrote:

Greetings,

In an MTT run just now, I'm noticing these funny output messages in
the middle of an early phase:




MPI install [mpi install: gcc warnings]
  Installing MPI: [ompi-nightly-trunk] / [1.3a1r17921] / [gcc  
warnings]...

[pc23.netbed.icics.ubc.ca:59263] mca: base: components_open: component
timer / linux open function failed
[pc23.netbed.icics.ubc.ca:59264] mca: base: components_open: component
timer / linux open function failed
[pc23.netbed.icics.ubc.ca:59265] mca: base: components_open: component
timer / linux open function failed
[pc23.netbed.icics.ubc.ca:59266] mca: base: components_open: component
timer / linux open function failed



Has anyone ever seen these?

Nonetheless, my tests proceed and run after this.  This is on a
FreeBSD 6.2 system, so having "linux" in the message seems
interesting...  any suggestions?

Thanks!
brad
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Proc modex change

2008-03-25 Thread Jeff Squyres

Sounds reasonable to me.  Anyone else care?

On Mar 20, 2008, at 2:03 PM, Brian W. Barrett wrote:

Hi all -

Does anyone know why we go through the modex receive and for the  
local process in ompi_proc_get_info()?  It doesn't seem like it's  
necessary, and it causes some problems on platforms that don't  
implement the modex (since it zeros out useful information  
determined during the init step).  If no one has any objections, I'd  
like to commit the attached patch that fixes that problem.



Thanks,

Brian 
< 
proc_local_no_modex 
.diff>___

devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Using coverity results

2008-03-25 Thread Jeff Squyres
I put up some notes about using the Coverity web tool on the wiki  
(there's a link from the front page, too):


https://svn.open-mpi.org/trac/ompi/wiki/Coverity

I also asked Coverity if they could install libibverbs so that we can  
get scanning of the openib BTL.  Let's see what they say; if they can  
do it, they may be able to install other support libraries as well.



On Mar 25, 2008, at 4:53 PM, Jeff Squyres wrote:

I have started checking the Coverity results again.  There's a lot of
good stuff in there!  But there's some false positives, as well.

I *strongly* encourage developers to start checking the coverity
results; it takes a little time to dig through the issues that it
finds and decide whether they're real or not.  If we each take just a
few a day, we could be in pretty good shape for v1.3.

--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



[OMPI devel] 1.2.6 testing

2008-03-25 Thread Jeff Squyres
I mentioned on the call today that there were some bsend errors  
showing up in MTT on the v1.2 branch.


I am now pretty sure that these are some weird artifact of my test  
environment.  I cannot reproduce these errors manually.


I did some sanity testing on 1.2.6rc3 today and it looks good to me.

--
Jeff Squyres
Cisco Systems



[OMPI devel] Using coverity results

2008-03-25 Thread Jeff Squyres
I have started checking the Coverity results again.  There's a lot of  
good stuff in there!  But there's some false positives, as well.


I *strongly* encourage developers to start checking the coverity  
results; it takes a little time to dig through the issues that it  
finds and decide whether they're real or not.  If we each take just a  
few a day, we could be in pretty good shape for v1.3.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] iof/libevent failures?

2008-03-25 Thread Jeff Squyres

Further discussion on: https://svn.open-mpi.org/trac/ompi/ticket/1253


On Mar 25, 2008, at 8:54 AM, Ralph H Castain wrote:

I cannot replicate this with a debug build


I was doing all my work in a debug build, Tim - may be why I didn't  
see the

problem.

There is an issue with libevent right now and pty's/select/event
registration. George, Jeff, and I have been chatting about it as it  
breaks

the Mac completely.

Will let them take it from there...


On 3/25/08 6:49 AM, "Tim Prins"  wrote:


Hi everyone,

For the last couple nights ALL of our mtt runs have been failing
(although the failure is masked because mpirun is returning the wrong
error code) with:

[odin005.cs.indiana.edu:28167] [[46567,0],0] ORTE_ERROR_LOG: Error  
in file

base/plm_base_launch_support.c at line 161
--
mpirun was unable to start the specified application as it  
encountered

an error.
More information may be available above.
--

This line is where we try to do an IOF push. It looks like it was  
broken
somewhere between r17922 and r17926, which includes the libevent  
merge.


I cannot replicate this with a debug build, so I thought I would  
throw

this out before I look any further.

Thanks,

Tim
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



[OMPI devel] Open MPI v1.2.6rc3 has been posted

2008-03-25 Thread Tim Mattox
Hi All,
The next release candidate (rc3) of Open MPI v1.2.6 is now up:

 http://www.open-mpi.org/software/ompi/v1.2/

Please run it through it's paces as best you can.
-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
 I'm a bright... http://www.the-brights.net/


Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r17951

2008-03-25 Thread Jeff Squyres

Note: you do *not* need to re-autogen for this commit.

On Mar 25, 2008, at 9:41 AM, gship...@osl.iu.edu wrote:

Author: gshipman
Date: 2008-03-25 09:41:09 EDT (Tue, 25 Mar 2008)
New Revision: 17951
URL: https://svn.open-mpi.org/trac/ompi/changeset/17951

Log:
need orted_LDFLAGS as a placeholder
 you will need to re autogen.sh

Text files modified:
  trunk/orte/tools/orted/Makefile.am | 9 -
  1 files changed, 8 insertions(+), 1 deletions(-)

Modified: trunk/orte/tools/orted/Makefile.am
=
=
=
=
=
=
=
=
==
--- trunk/orte/tools/orted/Makefile.am  (original)
+++ trunk/orte/tools/orted/Makefile.am	2008-03-25 09:41:09 EDT (Tue,  
25 Mar 2008)

@@ -25,5 +25,12 @@

endif # OMPI_INSTALL_BINARIES

-orted_SOURCES = orted.c
+orted_SOURCES = orted.c
+# the following empty orted_LDFLAGS is used
+#  so that the orted can be compiled statically
+#  by simply changing the value of this from
+#  nothing to -all-static in the Makefile.in
+#  nice for systems that don't have all the shared
+#  libraries on the computes
+orted_LDFLAGS =
orted_LDADD = $(top_builddir)/orte/libopen-rte.la
___
svn-full mailing list
svn-f...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn-full



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] iof/libevent failures?

2008-03-25 Thread Jeff Squyres
Crud, ok.  I added this info to https://svn.open-mpi.org/trac/ompi/ticket/1253 
; hopefully we'll resolve it today.


I guess people didn't test the libevent-merge branch before we brought  
it to the trunk.  :-(



On Mar 25, 2008, at 9:22 AM, Tim Prins wrote:
I was able to replicate the failure with a debug build by running  
mpirun

through a batch job. I then added the parameter you gave me, and it
worked fine with the parameter.

Thanks,

Tim

Jeff Squyres wrote:
We're chasing down a problem that we're having on OSX w.r.t.  
libevent,

too -- can you try running with:

   --mca opal_event_include select

and see if that fixes the problem for you?


On Mar 25, 2008, at 8:49 AM, Tim Prins wrote:

Hi everyone,

For the last couple nights ALL of our mtt runs have been failing
(although the failure is masked because mpirun is returning the  
wrong

error code) with:

[odin005.cs.indiana.edu:28167] [[46567,0],0] ORTE_ERROR_LOG: Error
in file
base/plm_base_launch_support.c at line 161
--
mpirun was unable to start the specified application as it  
encountered

an error.
More information may be available above.
--

This line is where we try to do an IOF push. It looks like it was
broken
somewhere between r17922 and r17926, which includes the libevent
merge.

I cannot replicate this with a debug build, so I thought I would  
throw

this out before I look any further.

Thanks,

Tim
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] iof/libevent failures?

2008-03-25 Thread Tim Prins
I was able to replicate the failure with a debug build by running mpirun 
through a batch job. I then added the parameter you gave me, and it 
worked fine with the parameter.


Thanks,

Tim

Jeff Squyres wrote:
We're chasing down a problem that we're having on OSX w.r.t. libevent,  
too -- can you try running with:


--mca opal_event_include select

and see if that fixes the problem for you?


On Mar 25, 2008, at 8:49 AM, Tim Prins wrote:

Hi everyone,

For the last couple nights ALL of our mtt runs have been failing
(although the failure is masked because mpirun is returning the wrong
error code) with:

[odin005.cs.indiana.edu:28167] [[46567,0],0] ORTE_ERROR_LOG: Error  
in file

base/plm_base_launch_support.c at line 161
--
mpirun was unable to start the specified application as it encountered
an error.
More information may be available above.
--

This line is where we try to do an IOF push. It looks like it was  
broken
somewhere between r17922 and r17926, which includes the libevent  
merge.


I cannot replicate this with a debug build, so I thought I would throw
this out before I look any further.

Thanks,

Tim
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel







Re: [OMPI devel] iof/libevent failures?

2008-03-25 Thread Jeff Squyres
We're chasing down a problem that we're having on OSX w.r.t. libevent,  
too -- can you try running with:


   --mca opal_event_include select

and see if that fixes the problem for you?


On Mar 25, 2008, at 8:49 AM, Tim Prins wrote:

Hi everyone,

For the last couple nights ALL of our mtt runs have been failing
(although the failure is masked because mpirun is returning the wrong
error code) with:

[odin005.cs.indiana.edu:28167] [[46567,0],0] ORTE_ERROR_LOG: Error  
in file

base/plm_base_launch_support.c at line 161
--
mpirun was unable to start the specified application as it encountered
an error.
More information may be available above.
--

This line is where we try to do an IOF push. It looks like it was  
broken
somewhere between r17922 and r17926, which includes the libevent  
merge.


I cannot replicate this with a debug build, so I thought I would throw
this out before I look any further.

Thanks,

Tim
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



[OMPI devel] iof/libevent failures?

2008-03-25 Thread Tim Prins

Hi everyone,

For the last couple nights ALL of our mtt runs have been failing 
(although the failure is masked because mpirun is returning the wrong 
error code) with:


[odin005.cs.indiana.edu:28167] [[46567,0],0] ORTE_ERROR_LOG: Error in file
base/plm_base_launch_support.c at line 161
--
mpirun was unable to start the specified application as it encountered
an error.
More information may be available above.
--

This line is where we try to do an IOF push. It looks like it was broken 
somewhere between r17922 and r17926, which includes the libevent merge.


I cannot replicate this with a debug build, so I thought I would throw 
this out before I look any further.


Thanks,

Tim


[OMPI devel] Return code and error message problems

2008-03-25 Thread Tim Prins

Hi,

Something went wrong last night and all our MTT tests had the following 
output:

[odin005.cs.indiana.edu:28167] [[46567,0],0] ORTE_ERROR_LOG: Error in file
base/plm_base_launch_support.c at line 161
--
mpirun was unable to start the specified application as it encountered 
an error.

More information may be available above.
--

I have not tracked down what caused this, but the more immediate problem 
is that after giving this error mpirun returned '0' instead of a more 
sane error value.




Also, when running the test 'orte/test/mpi/abort' I get the error output:
--
mpirun has exited due to process rank 1 with PID 17822 on
node odin013 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--

Which is wrong, it should be saying that the process was aborted. It 
looks like somehow the job state is being set to 
ORTE_JOB_STATE_ABORTED_WO_SYNC  instead of ORTE_JOB_STATE_ABORTED.


Thanks,

Tim