[OMPI users] error messages for btl components that aren't loaded

2006-06-28 Thread Patrick Jessee
Hello. I'm getting some odd error messages in certain situations associated with the btl components (happens with both 1.0.2 and 1.1). When certain btl components are NOT loaded, openMPI issues error messages associated with those very components. For instance, consider an application that

Re: [OMPI users] Re : OpenMPI 1.1: Signal:10, info.si_errno:0(Unknown, error: 0), si_code:1(BUS_ADRALN)

2006-06-28 Thread Eric Thibodeau
I am actually running the released 1.1. I can send you my code, if you want, and you could try running it off a single node with -np 4 or 5 (oversubscribing) and see if you get a BUS_ADRALN error off one node. The only restriction to compiling the code is that X libs be available (display is not

Re: [OMPI users] Re : OpenMPI 1.1: Signal:10, info.si_errno:0(Unknown, error: 0), si_code:1(BUS_ADRALN)

2006-06-28 Thread Terry D. Dontje
Well, I've been using the trunk and not 1.1. I also just built 1.1.1a1r10538 and ran it with no bus error. Though you are running 1.1b5r10421 so we're not running the same thing, as of yet. I have a cluster of two v440 that have 4 cpus each running Solaris 10. The tests I am running are np

Re: [OMPI users] Re : OpenMPI 1.1: Signal:10, info.si_errno:0(Unknown, error: 0 ), si_code:1(BUS_ADRALN)

2006-06-28 Thread Eric Thibodeau
Terry, I was about to comment on this. could you tell me the specs of your machine. As you will notice in "my thread", I am running into problems on Sparc SPM systems where the CPU borad's RTC are in a doubtfull state. Are-you running 1.1 on SMP machines. If so, on how many procs and wh

[OMPI users] Re : OpenMPI 1.1: Signal:10, info.si_errno:0(Unknown, error: 0), si_code:1(BUS_ADRALN)

2006-06-28 Thread Terry D. Dontje
Frank, Can you set your limit coredumpsize to non-zero rerun the program and then get the stack via dbx? So, I have a similar case of BUS_ADRALN on SPARC systems with an older version (June 21st) of the trunk. I've since run using the latest trunk and the bus went away. I am now going to try

Re: [OMPI users] Openmpi 1.1: startup problem caused by file descriptor state

2006-06-28 Thread Patrick Jessee
Brian Barrett wrote: On Wed, 2006-06-28 at 09:43 -0400, Patrick Jessee wrote: Hello. I've tracked down the source of the previously reported startup problem with Openmpi 1.1. On startup, it fails with the messages: mca_oob_tcp_accept: accept() failed with errno 9. : This didn't happe

Re: [OMPI users] users Digest, Vol 317, Issue 4

2006-06-28 Thread Eric Thibodeau
The problems was resolved in the 1.1 series...so I didn't push any further. Thanks! Le mercredi 28 juin 2006 09:21, openmpi-user a écrit : > Hi Eric (and all), > > don't know if this really messes things up, but you have set up lam-mpi > in your path-variables, too: > > [enterprise:24786] pls:

Re: [OMPI users] Openmpi 1.1: startup problem caused by file descriptor state

2006-06-28 Thread Brian Barrett
On Wed, 2006-06-28 at 09:43 -0400, Patrick Jessee wrote: > Hello. I've tracked down the source of the previously reported startup > problem with Openmpi 1.1. On startup, it fails with the messages: > > mca_oob_tcp_accept: accept() failed with errno 9. > : > > This didn't happen with 1.0.2.

Re: [OMPI users] Installing OpenMPI on a solaris

2006-06-28 Thread Eric Thibodeau
Yeah bummers, but something tells me it might not be OpenMPI's fault. Here's why: 1- The tech that takes care of these machines told me that he gets RTC errors on bootup (the cpu borads are apprantly "out of sync" since the clocks aren't set correctly). 2- There is also a possibility that the p

[OMPI users] Openmpi 1.1: startup problem caused by file descriptor state

2006-06-28 Thread Patrick Jessee
Hello. I've tracked down the source of the previously reported startup problem with Openmpi 1.1. On startup, it fails with the messages: mca_oob_tcp_accept: accept() failed with errno 9. : This didn't happen with 1.0.2. The trigger for this behavior is if standard input happens to be clos

Re: [OMPI users] users Digest, Vol 317, Issue 4

2006-06-28 Thread openmpi-user
Hi Eric (and all), don't know if this really messes things up, but you have set up lam-mpi in your path-variables, too: [enterprise:24786] pls:rsh: reset LD_LIBRARY_PATH: /export/lca/home/lca0/etudiants/ac38820/openmpi_sun4u/lib:/export/lca/appl/Forte/SUNWspro/WS6U2/lib:/usr/local/lib:*/usr/l

Re: [OMPI users] rsh/ssh is work but mpirun hang ?

2006-06-28 Thread Jeff Squyres (jsquyres)
This *may* be due to stdio blocking issues (e.g., not getting the password/passphrase to ssh properly, so the application never actually launches on the remote node). The first thing I would do is find out why you are getting prompted for a password. Open MPI requires that you are not prompte

[OMPI users] FW: mpi_allreduce error

2006-06-28 Thread Jeff Squyres (jsquyres)
(this thread started as a LAM question [http://www.lam-mpi.org/MailArchives/lam/2006/06/12497.php], and one message contained an Open MPI question, so I took the liberty of moving it to the OMPI user's list) > As for openmpi, I get a lot of messages like this > > global_ssi(1441) malloc: *** Dea

Re: [OMPI users] rsh/ssh is work but mpirun hang ?

2006-06-28 Thread shen T.T.
If i mpirun the MPI application--'hello world' on a single computer(dual core) itself, it is work. But it can't be successful when i mpirun it across multiple nodes. The rsh/ssh agent is work, i can rsh/ssh to other nodes.Everytime i mpirun 'hostname' , the remote rsh/ssh agent ask for the passw

Re: [OMPI users] Installing OpenMPI on a solaris

2006-06-28 Thread Jeff Squyres (jsquyres)
Bummer! :-( Just to be sure -- you had a clean config.cache file before you ran configure, right? (e.g., the file didn't exist -- just to be sure it didn't get potentially erroneous values from a previous run of configure) Also, FWIW, it's not necessary to specify --enable-ltdl-convenience;

Re: [OMPI users] OpenMPI 1.1: Signal:10 info.si_errno:0(Unknown, error: 0), si_code:1(BUS_ADRALN) (Terry D. Dontje)

2006-06-28 Thread openmpi-user
omponent v1.1) Enclosed you'll find the config.log. Yours, Frank -- next part -- An embedded and charset-unspecified text was scrubbed... Name: config.log Url: http://www.open-mpi.org/MailArchives/users/attachments/20060628/a640acf1/config.pl

Re: [OMPI users] Fw: OpenMPI version 1.1

2006-06-28 Thread Jeff Squyres (jsquyres)
A common problem that I have seen is that all nodes in the cluster may not be configured identically. For example, can you confirm that eth1 is your gigE interface on all nodes? It might have accidentally been configured to be your IPoIB interface on some nodes. If that's not the case, let us k

Re: [OMPI users] rsh/ssh is work but mpirun hang ?

2006-06-28 Thread Jeff Squyres (jsquyres)
Can you provide a little more information? What exactly are you trying to mpirun across multiple nodes? Is it an MPI application or a non-MPI application? For example, can you mpirun "hostname" (i.e., the Unix hostname utility) across multiple nodes successfully? If you're trying to mpirun

Re: [OMPI users] OpenMPI 1.1: Signal:10 info.si_errno:0(Unknown error: 0), si_code:1(BUS_ADRALN) (Frank)

2006-06-28 Thread Frank
, API v1.0, Component v1.1) Enclosed you'll find the config.log. Yours, Frank -- next part -- An embedded and charset-unspecified text was scrubbed... Name: config.log Url: http://www.open-mpi.org/MailArchives/users/attachments/20060628/a640acf1/config.pl --

Re: [OMPI users] OpenMPI 1.1: Signal:10 info.si_errno:0(Unknown, error: 0) si_code:1(BUS_ADRALN)

2006-06-28 Thread Terry D. Dontje
next part -- An embedded and charset-unspecified text was scrubbed... Name: config.log Url: http://www.open-mpi.org/MailArchives/users/attachments/20060628/a640acf1/config.pl -- ___ users mailing list us...@ope

[OMPI users] OpenMPI 1.1: Signal:10 info.si_errno:0(Unknown error: 0) si_code:1(BUS_ADRALN)

2006-06-28 Thread Frank
Hi! I've recently updated to OpenMPI 1.1 on a few nodes and running into a problem that wasn't there with OpenMPI 1.0.2. Submitting a job to the XGrid with OpenMPI 1.1 yields a Bus error that isn't there when not submitting the job to the XGrid: [g5dual:/Network/CFD/MVH-1.0] motte% mpirun -