[OMPI users] orte_pls_base_select fails

2007-07-18 Thread Adam C Powell IV
Greetings,

I'm running the Debian package of OpenMPI in a chroot (with /proc
mounted properly), and orte_init is failing as follows:

$ uptime
 12:51:55 up 12 days, 21:30,  0 users,  load average: 0.00, 0.00, 0.00
$ orterun -np 1 uptime
[new-host-3:18250] [0,0,0] ORTE_ERROR_LOG: Error in file 
runtime/orte_init_stage1.c at line 312
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_pls_base_select failed
  --> Returned value -1 instead of ORTE_SUCCESS

--
[new-host-3:18250] [0,0,0] ORTE_ERROR_LOG: Error in file 
runtime/orte_system_init.c at line 42
[new-host-3:18250] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at 
line 52
--
Open RTE was unable to initialize properly.  The error occured while
attempting to orte_init().  Returned value -1 instead of ORTE_SUCCESS.
--

Note running with -v produces no more output than this.  Running orted
in the background doesn't seem to help.

What could be wrong?  Does orterun not run in a chroot environment?
What more can I do to investigate further?

Thanks,
-Adam
-- 
GPG fingerprint: D54D 1AEE B11C CE9B A02B  C5DD 526F 01E8 564E E4B6

Welcome to the best software in the world today cafe!
http://www.take6.com/albums/greatesthits.html



Re: [OMPI users] orte_pls_base_select fails

2007-07-18 Thread Tim Prins

Adam C Powell IV wrote:

Greetings,

I'm running the Debian package of OpenMPI in a chroot (with /proc
mounted properly), and orte_init is failing as follows:

$ uptime
 12:51:55 up 12 days, 21:30,  0 users,  load average: 0.00, 0.00, 0.00
$ orterun -np 1 uptime
[new-host-3:18250] [0,0,0] ORTE_ERROR_LOG: Error in file 
runtime/orte_init_stage1.c at line 312
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_pls_base_select failed
  --> Returned value -1 instead of ORTE_SUCCESS

--
[new-host-3:18250] [0,0,0] ORTE_ERROR_LOG: Error in file 
runtime/orte_system_init.c at line 42
[new-host-3:18250] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at 
line 52
--
Open RTE was unable to initialize properly.  The error occured while
attempting to orte_init().  Returned value -1 instead of ORTE_SUCCESS.
--

Note running with -v produces no more output than this.  Running orted
in the background doesn't seem to help.

What could be wrong?  Does orterun not run in a chroot environment?
What more can I do to investigate further?

Try running mpirun with the added options:
-mca orte_debug 1 -mca pls_base_verbose 20

Then send the output to the list.

Thanks,

Tim



Thanks,
-Adam




Re: [OMPI users] orte_pls_base_select fails

2007-07-18 Thread Adam C Powell IV
On Wed, 2007-07-18 at 09:50 -0400, Tim Prins wrote:
> Adam C Powell IV wrote:
> > Greetings,
> > 
> > I'm running the Debian package of OpenMPI in a chroot (with /proc
> > mounted properly), and orte_init is failing as follows:
> > [snip]
> > What could be wrong?  Does orterun not run in a chroot environment?
> > What more can I do to investigate further?
> Try running mpirun with the added options:
> -mca orte_debug 1 -mca pls_base_verbose 20
> 
> Then send the output to the list.

Thanks!  Here's the output:

$ orterun -mca orte_debug 1 -mca pls_base_verbose 20 -np 1 uptime
[new-host-3:19201] mca: base: components_open: Looking for pls components
[new-host-3:19201] mca: base: components_open: distilling pls components
[new-host-3:19201] mca: base: components_open: accepting all pls components
[new-host-3:19201] mca: base: components_open: opening pls components
[new-host-3:19201] mca: base: components_open: found loaded component 
gridengine[new-host-3:19201] mca: base: components_open: component gridengine 
open function successful
[new-host-3:19201] mca: base: components_open: found loaded component proxy
[new-host-3:19201] mca: base: components_open: component proxy open function 
successful
[new-host-3:19201] mca: base: components_open: found loaded component rsh
[new-host-3:19201] mca: base: components_open: component rsh open function 
successful
[new-host-3:19201] mca: base: components_open: found loaded component slurm
[new-host-3:19201] mca: base: components_open: component slurm open function 
successful
[new-host-3:19201] orte:base:select: querying component gridengine
[new-host-3:19201] pls:gridengine: NOT available for selection
[new-host-3:19201] orte:base:select: querying component proxy
[new-host-3:19201] orte:base:select: querying component rsh
[new-host-3:19201] orte:base:select: querying component slurm
[new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file 
runtime/orte_init_stage1.c at line 312
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_pls_base_select failed
  --> Returned value -1 instead of ORTE_SUCCESS

--
[new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file 
runtime/orte_system_init.c at line 42
[new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at 
line 52
--
Open RTE was unable to initialize properly.  The error occured while
attempting to orte_init().  Returned value -1 instead of ORTE_SUCCESS.
--

-Adam
-- 
GPG fingerprint: D54D 1AEE B11C CE9B A02B  C5DD 526F 01E8 564E E4B6

Welcome to the best software in the world today cafe!
http://www.take6.com/albums/greatesthits.html



Re: [OMPI users] orte_pls_base_select fails

2007-07-18 Thread Tim Prins
This is strange. I assume that you what to use rsh or ssh to launch the 
processes?


If you want to use ssh, does "which ssh" find ssh? Similarly, if you 
want to use rsh, does "which rsh" find rsh?


Thanks,

Tim

Adam C Powell IV wrote:

On Wed, 2007-07-18 at 09:50 -0400, Tim Prins wrote:

Adam C Powell IV wrote:

Greetings,

I'm running the Debian package of OpenMPI in a chroot (with /proc
mounted properly), and orte_init is failing as follows:
[snip]
What could be wrong?  Does orterun not run in a chroot environment?
What more can I do to investigate further?

Try running mpirun with the added options:
-mca orte_debug 1 -mca pls_base_verbose 20

Then send the output to the list.


Thanks!  Here's the output:

$ orterun -mca orte_debug 1 -mca pls_base_verbose 20 -np 1 uptime
[new-host-3:19201] mca: base: components_open: Looking for pls components
[new-host-3:19201] mca: base: components_open: distilling pls components
[new-host-3:19201] mca: base: components_open: accepting all pls components
[new-host-3:19201] mca: base: components_open: opening pls components
[new-host-3:19201] mca: base: components_open: found loaded component 
gridengine[new-host-3:19201] mca: base: components_open: component gridengine 
open function successful
[new-host-3:19201] mca: base: components_open: found loaded component proxy
[new-host-3:19201] mca: base: components_open: component proxy open function 
successful
[new-host-3:19201] mca: base: components_open: found loaded component rsh
[new-host-3:19201] mca: base: components_open: component rsh open function 
successful
[new-host-3:19201] mca: base: components_open: found loaded component slurm
[new-host-3:19201] mca: base: components_open: component slurm open function 
successful
[new-host-3:19201] orte:base:select: querying component gridengine
[new-host-3:19201] pls:gridengine: NOT available for selection
[new-host-3:19201] orte:base:select: querying component proxy
[new-host-3:19201] orte:base:select: querying component rsh
[new-host-3:19201] orte:base:select: querying component slurm
[new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file 
runtime/orte_init_stage1.c at line 312
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_pls_base_select failed
  --> Returned value -1 instead of ORTE_SUCCESS

--
[new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file 
runtime/orte_system_init.c at line 42
[new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at 
line 52
--
Open RTE was unable to initialize properly.  The error occured while
attempting to orte_init().  Returned value -1 instead of ORTE_SUCCESS.
--

-Adam




[OMPI users] Octave MPITB for Open-MPI

2007-07-18 Thread Javier Fernández


MPITB is an Octave toolbox for MPI. The new release also works with 
Open-MPI. Test reports are welcome, since this is an initial release. 
For more information see http://atc.ugr.es/javier-bin/mpitb


Thanks

-javier


Re: [OMPI users] DataTypes with "holes" for writing files

2007-07-18 Thread Robert Latham
On Tue, Jul 10, 2007 at 04:36:01PM +, jody wrote:
> I think there is still some problem.
> I create different datatypes by resizing MPI_SHORT with
> different negative lower bounds (depending on the rank)
> and the same extent (only depending on the number of processes).
> 
> However, I get an error as soon as  MPI_File_set_view is called with my new
> datatype:
> 
> Error: Unsupported datatype passed to ADIOI_Count_contiguous_blocks
> [aim-nano_02:9] MPI_ABORT invoked on rank 0 in communicator
> MPI_COMM_WORLD with errorcode 1

Hi Jody

I was wrong about this being a problem with OpenMPI's version of
ROMIO.  The OpenMPI guys have synced up fairly recently with the
OpenMPI in MPICH2.  

ROMIO, even the very latest in CVS version, doesn't support resized
types yet.  

Looks like you'll have to take George's alternate idea of MPI_UB and
MPI_LB. 

We'll let the OpenMPI guys know when resized support is in place.

Sorry for the confusion.
==rob

-- 
Rob Latham
Mathematics and Computer Science DivisionA215 0178 EA2D B059 8CDF
Argonne National Lab, IL USA B29D F333 664A 4280 315B


Re: [OMPI users] orte_pls_base_select fails

2007-07-18 Thread Adam C Powell IV
As mentioned, I'm running in a chroot environment, so rsh and ssh won't
work: "rsh localhost" will rsh into the primary local host environment,
not the chroot, which will fail.

[The purpose is to be able to build and test MPI programs in the Debian
unstable distribution, without upgrading the whole machine to unstable.
Though most machines I use for this purpose run Debian stable or
testing, the machine I'm currently using runs a very old Fedora, for
which I don't think OpenMPI is available.]

With MPICH, mpirun -np 1 just runs the new process in the current
context, without rsh/ssh, so it works in a chroot.  Does OpenMPI not
support this functionality?

Thanks,
Adam

On Wed, 2007-07-18 at 11:09 -0400, Tim Prins wrote:
> This is strange. I assume that you what to use rsh or ssh to launch the 
> processes?
> 
> If you want to use ssh, does "which ssh" find ssh? Similarly, if you 
> want to use rsh, does "which rsh" find rsh?
> 
> Thanks,
> 
> Tim
> 
> Adam C Powell IV wrote:
> > On Wed, 2007-07-18 at 09:50 -0400, Tim Prins wrote:
> >> Adam C Powell IV wrote:
> >>> Greetings,
> >>>
> >>> I'm running the Debian package of OpenMPI in a chroot (with /proc
> >>> mounted properly), and orte_init is failing as follows:
> >>> [snip]
> >>> What could be wrong?  Does orterun not run in a chroot environment?
> >>> What more can I do to investigate further?
> >> Try running mpirun with the added options:
> >> -mca orte_debug 1 -mca pls_base_verbose 20
> >>
> >> Then send the output to the list.
> > 
> > Thanks!  Here's the output:
> > 
> > $ orterun -mca orte_debug 1 -mca pls_base_verbose 20 -np 1 uptime
> > [new-host-3:19201] mca: base: components_open: Looking for pls components
> > [new-host-3:19201] mca: base: components_open: distilling pls components
> > [new-host-3:19201] mca: base: components_open: accepting all pls components
> > [new-host-3:19201] mca: base: components_open: opening pls components
> > [new-host-3:19201] mca: base: components_open: found loaded component 
> > gridengine[new-host-3:19201] mca: base: components_open: component 
> > gridengine open function successful
> > [new-host-3:19201] mca: base: components_open: found loaded component proxy
> > [new-host-3:19201] mca: base: components_open: component proxy open 
> > function successful
> > [new-host-3:19201] mca: base: components_open: found loaded component rsh
> > [new-host-3:19201] mca: base: components_open: component rsh open function 
> > successful
> > [new-host-3:19201] mca: base: components_open: found loaded component slurm
> > [new-host-3:19201] mca: base: components_open: component slurm open 
> > function successful
> > [new-host-3:19201] orte:base:select: querying component gridengine
> > [new-host-3:19201] pls:gridengine: NOT available for selection
> > [new-host-3:19201] orte:base:select: querying component proxy
> > [new-host-3:19201] orte:base:select: querying component rsh
> > [new-host-3:19201] orte:base:select: querying component slurm
> > [new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file 
> > runtime/orte_init_stage1.c at line 312
> > --
> > It looks like orte_init failed for some reason; your parallel process is
> > likely to abort.  There are many reasons that a parallel process can
> > fail during orte_init; some of which are due to configuration or
> > environment problems.  This failure appears to be an internal failure;
> > here's some additional information (which may only be relevant to an
> > Open MPI developer):
> > 
> >   orte_pls_base_select failed
> >   --> Returned value -1 instead of ORTE_SUCCESS
> > 
> > --
> > [new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file 
> > runtime/orte_system_init.c at line 42
> > [new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file 
> > runtime/orte_init.c at line 52
> > --
> > Open RTE was unable to initialize properly.  The error occured while
> > attempting to orte_init().  Returned value -1 instead of ORTE_SUCCESS.
> > --
> > 
> > -Adam

-- 
GPG fingerprint: D54D 1AEE B11C CE9B A02B  C5DD 526F 01E8 564E E4B6

Welcome to the best software in the world today cafe!
http://www.take6.com/albums/greatesthits.html



Re: [OMPI users] orte_pls_base_select fails

2007-07-18 Thread Ralph H Castain



On 7/18/07 9:49 AM, "Adam C Powell IV"  wrote:

> As mentioned, I'm running in a chroot environment, so rsh and ssh won't
> work: "rsh localhost" will rsh into the primary local host environment,
> not the chroot, which will fail.
> 
> [The purpose is to be able to build and test MPI programs in the Debian
> unstable distribution, without upgrading the whole machine to unstable.
> Though most machines I use for this purpose run Debian stable or
> testing, the machine I'm currently using runs a very old Fedora, for
> which I don't think OpenMPI is available.]
> 
> With MPICH, mpirun -np 1 just runs the new process in the current
> context, without rsh/ssh, so it works in a chroot.  Does OpenMPI not
> support this functionality?

Yes - and no. OpenMPI will launch on a local node without using rsh/ssh.
However, and it is a big however, our init code requires that we still
identify a working launcher that could be used to launch on remote nodes.
Frankly, we never considered the case you describe.

We could (and perhaps should) modify the code to allow it to continue even
if it doesn't find a viable launcher. I believe our initial thinking was
that something that launched only on the local node wasn't much use to MPI
and therefore that scenario probably represents an error condition.

We'll discuss it and see what we think should be done. Meantime, the answer
would have to be "no, we don't support that"

Ralph

> 
> Thanks,
> Adam
> 
> On Wed, 2007-07-18 at 11:09 -0400, Tim Prins wrote:
>> This is strange. I assume that you what to use rsh or ssh to launch the
>> processes?
>> 
>> If you want to use ssh, does "which ssh" find ssh? Similarly, if you
>> want to use rsh, does "which rsh" find rsh?
>> 
>> Thanks,
>> 
>> Tim
>> 
>> Adam C Powell IV wrote:
>>> On Wed, 2007-07-18 at 09:50 -0400, Tim Prins wrote:
 Adam C Powell IV wrote:
> Greetings,
> 
> I'm running the Debian package of OpenMPI in a chroot (with /proc
> mounted properly), and orte_init is failing as follows:
> [snip]
> What could be wrong?  Does orterun not run in a chroot environment?
> What more can I do to investigate further?
 Try running mpirun with the added options:
 -mca orte_debug 1 -mca pls_base_verbose 20
 
 Then send the output to the list.
>>> 
>>> Thanks!  Here's the output:
>>> 
>>> $ orterun -mca orte_debug 1 -mca pls_base_verbose 20 -np 1 uptime
>>> [new-host-3:19201] mca: base: components_open: Looking for pls components
>>> [new-host-3:19201] mca: base: components_open: distilling pls components
>>> [new-host-3:19201] mca: base: components_open: accepting all pls components
>>> [new-host-3:19201] mca: base: components_open: opening pls components
>>> [new-host-3:19201] mca: base: components_open: found loaded component
>>> gridengine[new-host-3:19201] mca: base: components_open: component
>>> gridengine open function successful
>>> [new-host-3:19201] mca: base: components_open: found loaded component proxy
>>> [new-host-3:19201] mca: base: components_open: component proxy open function
>>> successful
>>> [new-host-3:19201] mca: base: components_open: found loaded component rsh
>>> [new-host-3:19201] mca: base: components_open: component rsh open function
>>> successful
>>> [new-host-3:19201] mca: base: components_open: found loaded component slurm
>>> [new-host-3:19201] mca: base: components_open: component slurm open function
>>> successful
>>> [new-host-3:19201] orte:base:select: querying component gridengine
>>> [new-host-3:19201] pls:gridengine: NOT available for selection
>>> [new-host-3:19201] orte:base:select: querying component proxy
>>> [new-host-3:19201] orte:base:select: querying component rsh
>>> [new-host-3:19201] orte:base:select: querying component slurm
>>> [new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file
>>> runtime/orte_init_stage1.c at line 312
>>> --
>>> It looks like orte_init failed for some reason; your parallel process is
>>> likely to abort.  There are many reasons that a parallel process can
>>> fail during orte_init; some of which are due to configuration or
>>> environment problems.  This failure appears to be an internal failure;
>>> here's some additional information (which may only be relevant to an
>>> Open MPI developer):
>>> 
>>>   orte_pls_base_select failed
>>>   --> Returned value -1 instead of ORTE_SUCCESS
>>> 
>>> --
>>> [new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file
>>> runtime/orte_system_init.c at line 42
>>> [new-host-3:19201] [0,0,0] ORTE_ERROR_LOG: Error in file runtime/orte_init.c
>>> at line 52
>>> --
>>> Open RTE was unable to initialize properly.  The error occured while
>>> attempting to orte_init().  Returned value -1 instead of ORTE_SUCCESS.
>>> 

Re: [OMPI users] orte_pls_base_select fails

2007-07-18 Thread Tim Prins

Adam C Powell IV wrote:

As mentioned, I'm running in a chroot environment, so rsh and ssh won't
work: "rsh localhost" will rsh into the primary local host environment,
not the chroot, which will fail.

[The purpose is to be able to build and test MPI programs in the Debian
unstable distribution, without upgrading the whole machine to unstable.
Though most machines I use for this purpose run Debian stable or
testing, the machine I'm currently using runs a very old Fedora, for
which I don't think OpenMPI is available.]


Allright, I understand what you are trying to do now. To be honest, I 
don't think we have ever really thought about this use case. We always 
figured that to test Open MPI people would simply install it in a 
different directory and use it from there.




With MPICH, mpirun -np 1 just runs the new process in the current
context, without rsh/ssh, so it works in a chroot.  Does OpenMPI not
support this functionality?


Open MPI does support this functionality. First, a bit of explanation:

We use 'pls' (process launching system) components to handling the 
launching of processes. There are components for slurm, gridengine, rsh, 
and others. At runtime we open each of these components and query them 
as to whether they can be used. The original error you posted says that 
none of the 'pls' components can be used because all of they detected 
they could not run in your setup. The slurm one excluded itself because 
there were no environment variables set indicating it is running under 
SLURM. Similarly, the gridengine pls said it cannot run as well. The 
'rsh' pls said it cannot run because neither 'ssh' nor 'rsh' are 
available (I assume this is the case, though you did not explicitly say 
they were not available).


But in this case, you do want the 'rsh' pls to be used. It will 
automatically fork any local processes, and will user rsh/ssh to launch 
any remote processes. Again, I don't think we ever imagined the use case 
 on a UNIX-like system where there are no launchers like SLURM 
available, and rsh/ssh also wasn't available (Open MPI is, after all, 
primarily concerned with multi-node operation).


So, there are several ways around this:

1. Make rsh or ssh available, even though they will not be used.

2. Tell the 'rsh' pls component to use a dummy program such as 
/bin/false by adding the following to the command line:

-mca pls_rsh_agent /bin/false

3. Create a dummy 'rsh' executable that is available in your path.

For instance:

[tprins@odin ~]$ which ssh
/usr/bin/which: no ssh in 
(/u/tprins/usr/ompia/bin:/u/tprins/usr/bin:/usr/local/bin:/bin:/usr/X11R6/bin)

[tprins@odin ~]$ which rsh
/usr/bin/which: no rsh in 
(/u/tprins/usr/ompia/bin:/u/tprins/usr/bin:/usr/local/bin:/bin:/usr/X11R6/bin)

[tprins@odin ~]$ mpirun -np 1  hostname
[odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file 
runtime/orte_init_stage1.c at line 317

--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_pls_base_select failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS

--
[odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file 
runtime/orte_system_init.c at line 46
[odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file 
runtime/orte_init.c at line 52
[odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file 
orterun.c at line 399


[tprins@odin ~]$ mpirun -np 1 -mca pls_rsh_agent /bin/false  hostname
odin.cs.indiana.edu

[tprins@odin ~]$ touch usr/bin/rsh
[tprins@odin ~]$ chmod +x usr/bin/rsh
[tprins@odin ~]$ mpirun -np 1  hostname
odin.cs.indiana.edu
[tprins@odin ~]$


I hope this helps,

Tim



Thanks,
Adam

On Wed, 2007-07-18 at 11:09 -0400, Tim Prins wrote:
This is strange. I assume that you what to use rsh or ssh to launch the 
processes?


If you want to use ssh, does "which ssh" find ssh? Similarly, if you 
want to use rsh, does "which rsh" find rsh?


Thanks,

Tim

Adam C Powell IV wrote:

On Wed, 2007-07-18 at 09:50 -0400, Tim Prins wrote:

Adam C Powell IV wrote:

Greetings,

I'm running the Debian package of OpenMPI in a chroot (with /proc
mounted properly), and orte_init is failing as follows:
[snip]
What could be wrong?  Does orterun not run in a chroot environment?
What more can I do to investigate further?

Try running mpirun with the added options:
-mca orte_debug 1 -mca pls_base_verbose 20

Then send the output to the list.

Thanks!  Here's the output:

$ orterun -mca orte_debug 1 -mca pls_base_verbose 20 -np 1 uptime
[new-host-3:19201] mca: base: components_open: Looking for 

Re: [OMPI users] mpirun hanging followup

2007-07-18 Thread Bill Johnstone
--- Ralph Castain  wrote:

> No, the session directory is created in the tmpdir - we don't create
> anything anywhere else, nor do we write any executables anywhere.

In the case where the TMPDIR env variable isn't specified, what is the
default assumed by Open MPI/orte?

> Just out of curiosity: although I know you have different arch's on
> your
> nodes, the tests you are running are all executing on the same arch,
> correct???

Yes, tests all execute on the same arch, although I am led to another
question.  Can I use a headnode of a particular arch, but in my mpirun
hostfile, specify only nodes of another arch, and launch from the
headnode?  In other words, no computation is done on the headnode of
arch A, all computation is done on nodes of arch B, but the job is
launched from the headnode -- would that be acceptable?

I should be clear that for the problem you are helping me with, *all*
the nodes involved are running the same arch, OS, compiler, system
libraries, etc.  The multiple arch question is for edification for the
future.





Got a little couch potato? 
Check out fun summer activities for kids.
http://search.yahoo.com/search?fr=oni_on_mail&p=summer+activities+for+kids&cs=bz
 


Re: [OMPI users] orte_pls_base_select fails

2007-07-18 Thread Ralph H Castain
Tim has proposed a clever fix that I had not thought of - just be aware that
it could cause unexpected behavior at some point. Still, for what you are
trying to do, that might meet your needs.

Ralph


On 7/18/07 11:44 AM, "Tim Prins"  wrote:

> Adam C Powell IV wrote:
>> As mentioned, I'm running in a chroot environment, so rsh and ssh won't
>> work: "rsh localhost" will rsh into the primary local host environment,
>> not the chroot, which will fail.
>> 
>> [The purpose is to be able to build and test MPI programs in the Debian
>> unstable distribution, without upgrading the whole machine to unstable.
>> Though most machines I use for this purpose run Debian stable or
>> testing, the machine I'm currently using runs a very old Fedora, for
>> which I don't think OpenMPI is available.]
> 
> Allright, I understand what you are trying to do now. To be honest, I
> don't think we have ever really thought about this use case. We always
> figured that to test Open MPI people would simply install it in a
> different directory and use it from there.
> 
>> 
>> With MPICH, mpirun -np 1 just runs the new process in the current
>> context, without rsh/ssh, so it works in a chroot.  Does OpenMPI not
>> support this functionality?
> 
> Open MPI does support this functionality. First, a bit of explanation:
> 
> We use 'pls' (process launching system) components to handling the
> launching of processes. There are components for slurm, gridengine, rsh,
> and others. At runtime we open each of these components and query them
> as to whether they can be used. The original error you posted says that
> none of the 'pls' components can be used because all of they detected
> they could not run in your setup. The slurm one excluded itself because
> there were no environment variables set indicating it is running under
> SLURM. Similarly, the gridengine pls said it cannot run as well. The
> 'rsh' pls said it cannot run because neither 'ssh' nor 'rsh' are
> available (I assume this is the case, though you did not explicitly say
> they were not available).
> 
> But in this case, you do want the 'rsh' pls to be used. It will
> automatically fork any local processes, and will user rsh/ssh to launch
> any remote processes. Again, I don't think we ever imagined the use case
>   on a UNIX-like system where there are no launchers like SLURM
> available, and rsh/ssh also wasn't available (Open MPI is, after all,
> primarily concerned with multi-node operation).
> 
> So, there are several ways around this:
> 
> 1. Make rsh or ssh available, even though they will not be used.
> 
> 2. Tell the 'rsh' pls component to use a dummy program such as
> /bin/false by adding the following to the command line:
> -mca pls_rsh_agent /bin/false
> 
> 3. Create a dummy 'rsh' executable that is available in your path.
> 
> For instance:
> 
> [tprins@odin ~]$ which ssh
> /usr/bin/which: no ssh in
> (/u/tprins/usr/ompia/bin:/u/tprins/usr/bin:/usr/local/bin:/bin:/usr/X11R6/bin)
> [tprins@odin ~]$ which rsh
> /usr/bin/which: no rsh in
> (/u/tprins/usr/ompia/bin:/u/tprins/usr/bin:/usr/local/bin:/bin:/usr/X11R6/bin)
> [tprins@odin ~]$ mpirun -np 1  hostname
> [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file
> runtime/orte_init_stage1.c at line 317
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>orte_pls_base_select failed
>--> Returned value Error (-1) instead of ORTE_SUCCESS
> 
> --
> [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file
> runtime/orte_system_init.c at line 46
> [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file
> runtime/orte_init.c at line 52
> [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file
> orterun.c at line 399
> 
> [tprins@odin ~]$ mpirun -np 1 -mca pls_rsh_agent /bin/false  hostname
> odin.cs.indiana.edu
> 
> [tprins@odin ~]$ touch usr/bin/rsh
> [tprins@odin ~]$ chmod +x usr/bin/rsh
> [tprins@odin ~]$ mpirun -np 1  hostname
> odin.cs.indiana.edu
> [tprins@odin ~]$
> 
> 
> I hope this helps,
> 
> Tim
> 
>> 
>> Thanks,
>> Adam
>> 
>> On Wed, 2007-07-18 at 11:09 -0400, Tim Prins wrote:
>>> This is strange. I assume that you what to use rsh or ssh to launch the
>>> processes?
>>> 
>>> If you want to use ssh, does "which ssh" find ssh? Similarly, if you
>>> want to use rsh, does "which rsh" find rsh?
>>> 
>>> Thanks,
>>> 
>>> Tim
>>> 
>>> Adam C Powell IV wrote:
 On Wed, 2007-07-18 at 09:50 -0400, Tim Prins wrote:
> Adam C Powell IV wrote:
>> Greetings,
>> 
>> I'm running the Debi

Re: [OMPI users] mpirun hanging followup

2007-07-18 Thread Ralph H Castain



On 7/18/07 11:46 AM, "Bill Johnstone"  wrote:

> --- Ralph Castain  wrote:
> 
>> No, the session directory is created in the tmpdir - we don't create
>> anything anywhere else, nor do we write any executables anywhere.
> 
> In the case where the TMPDIR env variable isn't specified, what is the
> default assumed by Open MPI/orte?

It rattles through a logic chain:

1. ompi mca param value

2. TMPDIR in environ

3. TMP in environ

4. default to /tmp just so we have something to work with...

> 
>> Just out of curiosity: although I know you have different arch's on
>> your
>> nodes, the tests you are running are all executing on the same arch,
>> correct???
> 
> Yes, tests all execute on the same arch, although I am led to another
> question.  Can I use a headnode of a particular arch, but in my mpirun
> hostfile, specify only nodes of another arch, and launch from the
> headnode?  In other words, no computation is done on the headnode of
> arch A, all computation is done on nodes of arch B, but the job is
> launched from the headnode -- would that be acceptable?

As long as the prefix is set such that the correct binary executables can be
found, then you should be fine.

> 
> I should be clear that for the problem you are helping me with, *all*
> the nodes involved are running the same arch, OS, compiler, system
> libraries, etc.  The multiple arch question is for edification for the
> future.

No problem - I just wanted to eliminate one possible complication for now.

Thanks
Ralph

> 
> 
> 
>
> __
> __
> Got a little couch potato?
> Check out fun summer activities for kids.
> http://search.yahoo.com/search?fr=oni_on_mail&p=summer+activities+for+kids&cs=
> bz 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] mpirun hanging followup

2007-07-18 Thread Bill Johnstone

--- Ralph Castain  wrote:

> Unfortunately, we don't have more debug statements internal to that
> function. I'll have to create a patch for you that will add some so
> we can
> better understand why it is failing - will try to send it to you on
> Wed.

Thank you for the patch you sent.

I solved the problem.  It was a head-slapper of an error.  Turned out
that I had forgotten -- the permissions on the filesystem override the
permissions of the mount point.  As I mentioned, these machines have an
NFS root filesystem.  In that filesystem, tmp has permissions 1777. 
However, when each node mounts its local temp partition to /tmp, the
permissions on that filesystem are the permissions the mount point
takes on.

In this case, I had forgotten to apply permissions 1777 to /tmp after
mounting on each machine.  As a result, /tmp really did not have the
appropriate permissions for mpirun to write to it as necessary.

Your patch helped me figure this out.  Technically, I should have been
able to figure it out from the messages you'd already sent to the
mailing list, but it wasn't until I saw the line in session_dir.c where
the error was occurring that I realized it had to be some kind of
permissions error.

I've attached the new debug output below:

[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
util/session_dir.c at line 108
[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
util/session_dir.c at line 391
[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
runtime/orte_init_stage1.c at line 626
--
It looks like orte_init failed for some reason; your parallel process
is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value -1 instead of ORTE_SUCCESS

--
[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
runtime/orte_system_init.c at line 42
[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 52
Open RTE was unable to initialize properly.  The error occured while
attempting to orte_init().  Returned value -1 instead of ORTE_SUCCESS.

Starting at line 108 of session_dir.c, is:

if (ORTE_SUCCESS != (ret = opal_os_dirpath_create(directory, my_mode)))
{
ORTE_ERROR_LOG(ret);
}

Three further points:

-Is there some reason ORTE can't bail out gracefully upon this error,
instead of hanging like it was doing for me?

-I think leaving in the extra debug logging code you sent me in the
patch for future Open MPI versions would be a good idea to help
troubleshoot problems like this.

-It would be nice to see "--debug-daemons" added to the Troubleshooting
section of the FAQ on the web site.

Thank you very very much for your help Ralph and everyone else that replied.




Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, 
photos & more. 
http://mobile.yahoo.com/go?refer=1GNXIC


Re: [OMPI users] orte_pls_base_select fails

2007-07-18 Thread Adam C Powell IV
On Wed, 2007-07-18 at 13:44 -0400, Tim Prins wrote:
> Adam C Powell IV wrote:
> > As mentioned, I'm running in a chroot environment, so rsh and ssh won't
> > work: "rsh localhost" will rsh into the primary local host environment,
> > not the chroot, which will fail.
> > 
> > [The purpose is to be able to build and test MPI programs in the Debian
> > unstable distribution, without upgrading the whole machine to unstable.
> > Though most machines I use for this purpose run Debian stable or
> > testing, the machine I'm currently using runs a very old Fedora, for
> > which I don't think OpenMPI is available.]
> 
> Allright, I understand what you are trying to do now. To be honest, I 
> don't think we have ever really thought about this use case. We always 
> figured that to test Open MPI people would simply install it in a 
> different directory and use it from there.
> 
> > With MPICH, mpirun -np 1 just runs the new process in the current
> > context, without rsh/ssh, so it works in a chroot.  Does OpenMPI not
> > support this functionality?
> 
> Open MPI does support this functionality. First, a bit of explanation:
> 
> We use 'pls' (process launching system) components to handling the 
> launching of processes. There are components for slurm, gridengine, rsh, 
> and others. At runtime we open each of these components and query them 
> as to whether they can be used. The original error you posted says that 
> none of the 'pls' components can be used because all of they detected 
> they could not run in your setup. The slurm one excluded itself because 
> there were no environment variables set indicating it is running under 
> SLURM. Similarly, the gridengine pls said it cannot run as well. The 
> 'rsh' pls said it cannot run because neither 'ssh' nor 'rsh' are 
> available (I assume this is the case, though you did not explicitly say 
> they were not available).
> 
> But in this case, you do want the 'rsh' pls to be used. It will 
> automatically fork any local processes, and will user rsh/ssh to launch 
> any remote processes. Again, I don't think we ever imagined the use case 
>   on a UNIX-like system where there are no launchers like SLURM 
> available, and rsh/ssh also wasn't available (Open MPI is, after all, 
> primarily concerned with multi-node operation).
> 
> So, there are several ways around this:
> 
> 1. Make rsh or ssh available, even though they will not be used.
> 
> 2. Tell the 'rsh' pls component to use a dummy program such as 
> /bin/false by adding the following to the command line:
> -mca pls_rsh_agent /bin/false
> 
> 3. Create a dummy 'rsh' executable that is available in your path.
> 
> For instance:
> 
> [tprins@odin ~]$ which ssh
> /usr/bin/which: no ssh in 
> (/u/tprins/usr/ompia/bin:/u/tprins/usr/bin:/usr/local/bin:/bin:/usr/X11R6/bin)
> [tprins@odin ~]$ which rsh
> /usr/bin/which: no rsh in 
> (/u/tprins/usr/ompia/bin:/u/tprins/usr/bin:/usr/local/bin:/bin:/usr/X11R6/bin)
> [tprins@odin ~]$ mpirun -np 1  hostname
> [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file 
> runtime/orte_init_stage1.c at line 317
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>orte_pls_base_select failed
>--> Returned value Error (-1) instead of ORTE_SUCCESS
> 
> --
> [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file 
> runtime/orte_system_init.c at line 46
> [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file 
> runtime/orte_init.c at line 52
> [odin.cs.indiana.edu:18913] [0,0,0] ORTE_ERROR_LOG: Error in file 
> orterun.c at line 399
> 
> [tprins@odin ~]$ mpirun -np 1 -mca pls_rsh_agent /bin/false  hostname
> odin.cs.indiana.edu
> 
> [tprins@odin ~]$ touch usr/bin/rsh
> [tprins@odin ~]$ chmod +x usr/bin/rsh
> [tprins@odin ~]$ mpirun -np 1  hostname
> odin.cs.indiana.edu
> [tprins@odin ~]$
> 
> 
> I hope this helps,
> 
> Tim

Yes, this helps tremendously.  I installed rsh, and now it pretty much
works.

The one missing detail is that I can't seem to get the stdout/stderr
output.  For example:

$ orterun -np 1 uptime
$ uptime
18:24:27 up 13 days,  3:03,  0 users,  load average: 0.00, 0.03, 0.00

The man page indicates that stdout/stderr is supposed to come back to
the stdout/stderr of the orterun process.  Any ideas on why this isn't
working?

Thank you again!

-Adam
-- 
GPG fingerprint: D54D 1AEE B11C CE9B A02B  C5DD 526F 01E8 564E E4B6

Welcome to the best software in the world today cafe!
http://www.take6.com/albums/greatesthits.html



Re: [OMPI users] mpirun hanging followup

2007-07-18 Thread Ralph H Castain
Hooray! Glad we could help track this down - sorry it was so hard to do so.

To answer your questions:

1. Yes - ORTE should bail out gracefully. It definitely should not hang. I
will log the problem and investigate. I believe I know where the problem
lies, and it may already be fixed on our trunk, but the fix may not get into
the 1.2 family (have to see what it would entail).

2. I will definitely commit that debug code and ensure it is in future
releases.

3. I'll see if we can add something about --debug-daemons to the FAQ -
thanks for pointing out that oversight.

Thanks
Ralph



On 7/18/07 12:19 PM, "Bill Johnstone"  wrote:

> 
> --- Ralph Castain  wrote:
> 
>> Unfortunately, we don't have more debug statements internal to that
>> function. I'll have to create a patch for you that will add some so
>> we can
>> better understand why it is failing - will try to send it to you on
>> Wed.
> 
> Thank you for the patch you sent.
> 
> I solved the problem.  It was a head-slapper of an error.  Turned out
> that I had forgotten -- the permissions on the filesystem override the
> permissions of the mount point.  As I mentioned, these machines have an
> NFS root filesystem.  In that filesystem, tmp has permissions 1777.
> However, when each node mounts its local temp partition to /tmp, the
> permissions on that filesystem are the permissions the mount point
> takes on.
> 
> In this case, I had forgotten to apply permissions 1777 to /tmp after
> mounting on each machine.  As a result, /tmp really did not have the
> appropriate permissions for mpirun to write to it as necessary.
> 
> Your patch helped me figure this out.  Technically, I should have been
> able to figure it out from the messages you'd already sent to the
> mailing list, but it wasn't until I saw the line in session_dir.c where
> the error was occurring that I realized it had to be some kind of
> permissions error.
> 
> I've attached the new debug output below:
> 
> [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
> util/session_dir.c at line 108
> [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
> util/session_dir.c at line 391
> [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
> runtime/orte_init_stage1.c at line 626
> --
> It looks like orte_init failed for some reason; your parallel process
> is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   orte_session_dir failed
>   --> Returned value -1 instead of ORTE_SUCCESS
> 
> --
> [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
> runtime/orte_system_init.c at line 42
> [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
> runtime/orte_init.c at line 52
> Open RTE was unable to initialize properly.  The error occured while
> attempting to orte_init().  Returned value -1 instead of ORTE_SUCCESS.
> 
> Starting at line 108 of session_dir.c, is:
> 
> if (ORTE_SUCCESS != (ret = opal_os_dirpath_create(directory, my_mode)))
> {
> ORTE_ERROR_LOG(ret);
> }
> 
> Three further points:
> 
> -Is there some reason ORTE can't bail out gracefully upon this error,
> instead of hanging like it was doing for me?
> 
> -I think leaving in the extra debug logging code you sent me in the
> patch for future Open MPI versions would be a good idea to help
> troubleshoot problems like this.
> 
> -It would be nice to see "--debug-daemons" added to the Troubleshooting
> section of the FAQ on the web site.
> 
> Thank you very very much for your help Ralph and everyone else that replied.
> 
> 
>
> __
> __
> Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail,
> news, photos & more.
> http://mobile.yahoo.com/go?refer=1GNXIC
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] orte_pls_base_select fails

2007-07-18 Thread Tim Prins

> Yes, this helps tremendously.  I installed rsh, and now it pretty much
> works.
Glad this worked out for you.

>
> The one missing detail is that I can't seem to get the stdout/stderr
> output.  For example:
>
> $ orterun -np 1 uptime
> $ uptime
> 18:24:27 up 13 days,  3:03,  0 users,  load average: 0.00, 0.03, 0.00
>
> The man page indicates that stdout/stderr is supposed to come back to
> the stdout/stderr of the orterun process.  Any ideas on why this isn't
> working?
It should work. However, we currently have some I/O forwarding problems which 
show up in some environments that will (hopefully) be fixed in the next 
release. As far as I know, the problem seems to happen mostly with non-mpi 
applications.

Try running a simple mpi application, such as:

#include 
#include "mpi.h"

int main(int argc, char* argv[])
{
int rank, size;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Hello, world, I am %d of %d\n", rank, size);
MPI_Finalize();

return 0;
}

If that works fine, then it is probably our problem, and not a problem with 
your setup.

Sorry I don't have a better answer :(

Tim






Re: [OMPI users] Problems running openmpi under os x

2007-07-18 Thread Tim Cornwell


Brian,

To close this one off, we found that one of our libraries has a  
malloc/free that was being called from ompi. I should have looked at  
the crash reporter. It reported


Exception:  EXC_BAD_ACCESS (0x0001)
Codes:  KERN_INVALID_ADDRESS (0x0001) at 0x05801bfc

Thread 0 Crashed:
0   libcasa_casa.dylib  0x0107b319 free + 51
1   libopen-pal.0.dylib 	0x0289eff9 opal_install_dirs_expand + 467  
(installdirs_base_expand.c:68)
2   libopen-pal.0.dylib 	0x0289e5a0 opal_installdirs_base_open + 1115  
(installdirs_base_components.c:96)
3   libopen-pal.0.dylib 	0x0287ba40 opal_init_util + 217 (opal_init.c: 
150)

4   libopen-pal.0.dylib 0x0287bb24 opal_init + 24 (opal_init.c:200)
5   libmpi.0.dylib  	0x01d745cd ompi_mpi_init + 33  
(ompi_mpi_init.c:219)

6   libmpi.0.dylib  0x01db48db MPI_Init + 293 (init.c:71)
7   ctest   0x2f90 main + 24 (ctest.cc:4)
8   ctest   0x2906 _start + 216
9   ctest   0x282d start + 41

On looking into this more, we found that the Lea Malloc was used in  
the casa_casa library. Removing it cured the problem.


Thanks for the help,

Tim

On 12/07/2007, at 2:54 PM, Tim Cornwell wrote:



Brian,

I think it's just a symbol clash. A test program linked with just  
mpicxx works fine but with our typical link, it fails. I've  
narrowed the problem down to a single shared library. This is from C 
++ and the symbols have a namespace casa. Weeding out all the the  
casa stuff and some other cruft, we're left with:


0009df14 T QuantaProxy::fits()
0011277c S int __gnu_cxx::__capture_isnan(double)
0014b4ae S std::invalid_argument::~invalid_argument()
0014b48e S std::invalid_argument::~invalid_argument()
00112790 S int std::isnan(double)
001200e8 S void** std::fill_n(void**,  
unsigned int, void* const&)
0012da12 S std::complex* std::fill_n*,  
unsigned int, std::complex >(std::complex*,  
unsigned int, std::complex const&)
0012d9ae S std::complex* std::fill_n*,  
unsigned int, std::complex >(std::complex*, unsigned  
int, std::complex const&)
00104a4c S bool* std::fill_n(bool*,  
unsigned int, bool const&)
0010b126 S double* std::fill_n 
(double*, unsigned int, double const&)
0012043a S float* std::fill_n(float*,  
unsigned int, float const&)
00120386 S int* std::fill_n(int*, unsigned  
int, int const&)
001203e0 S unsigned int* std::fill_nunsigned int>(unsigned int*, unsigned int, unsigned int const&)
00120322 S short* std::fill_n(short*,  
unsigned int, short const&)
0012d94a S unsigned short* std::fill_nint, unsigned short>(unsigned short*, unsigned int, unsigned short  
const&)
00112bf6 S void std::__reverse<__gnu_cxx::__normal_iteratorstd::basic_string,  
std::allocator > > >(__gnu_cxx::__normal_iteratorstd::basic_string,  
std::allocator > >, __gnu_cxx::__normal_iteratorstd::basic_string,  
std::allocator > >, std::random_access_iterator_tag)
00112bbc S __gnu_cxx::__normal_iteratorstd::basic_string,  
std::allocator > >  
std::transform<__gnu_cxx::__normal_iteratorstd::basic_string,  
std::allocator > >, __gnu_cxx::__normal_iteratorstd::basic_string,  
std::allocator > >, int (*)(int)> 
(__gnu_cxx::__normal_iteratorstd::char_traits, std::allocator > >,  
__gnu_cxx::__normal_iteratorstd::char_traits, std::allocator > >,  
__gnu_cxx::__normal_iteratorstd::char_traits, std::allocator > >, int (*)(int))

00198740 S typeinfo for std::invalid_argument
00192cac S typeinfo name for std::invalid_argument
001993e0 S vtable for std::invalid_argument


We're all using the standard of OS X:

$ mpicxx -v
Using built-in specs.
Target: i686-apple-darwin8
Configured with: /private/var/tmp/gcc/gcc-5367.obj~1/src/configure  
--disable-checking -enable-werror --prefix=/usr --mandir=/share/man  
--enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^ 
[cg][^.-]*$/s/$/-4.0/ --with-gxx-include-dir=/include/c++/4.0.0 -- 
with-slibdir=/usr/lib --build=powerpc-apple-darwin8 --with- 
arch=nocona --with-tune=generic --program-prefix= --host=i686-apple- 
darwin8 --target=i686-apple-darwin8

Thread model: posix
gcc version 4.0.1 (Apple Computer, Inc. build 5367)

Tim



On 12/07/2007, at 7:57 AM, Brian Barrett wrote:


That's unexpected.  If you run the command 'ompi_info --all', it
should list (towards the top) things like the Bindir and Libdir.  Can
you see if those have sane values?  If they do, can you try running a
simple hello, world type MPI application (there's one in the OMPI
tarball).  It almost looks like memory is getting corrupted, which
would be very unexpected that early in the process.  I'm unable to
duplicate the problem with 1.2.3 on my Mac Pro, making it all the
more strange.

Another random thought -- Which compilers did you use to build  
Open MPI?


Brian


On Jul 11, 2007, at 1:27 PM, Tim Cornwell wrote:



 Open MPI: 1.2.3
Open MPI SVN revision: r15136
 Open RTE: 1.2.3
Open RTE SVN revision: r15136
 OPAL: 1.2.3
   

Re: [OMPI users] orte_pls_base_select fails

2007-07-18 Thread Dirk Eddelbuettel

Hi Tim,

Thanks for the follow-up

On 18 July 2007 at 17:22, Tim Prins wrote:
| 
| > Yes, this helps tremendously.  I installed rsh, and now it pretty much
| > works.
| Glad this worked out for you.
| 
| >
| > The one missing detail is that I can't seem to get the stdout/stderr
| > output.  For example:
| >
| > $ orterun -np 1 uptime
| > $ uptime
| > 18:24:27 up 13 days,  3:03,  0 users,  load average: 0.00, 0.03, 0.00
| >
| > The man page indicates that stdout/stderr is supposed to come back to
| > the stdout/stderr of the orterun process.  Any ideas on why this isn't
| > working?
| It should work. However, we currently have some I/O forwarding problems which 
| show up in some environments that will (hopefully) be fixed in the next 
| release. As far as I know, the problem seems to happen mostly with non-mpi 
| applications.
| 
| Try running a simple mpi application, such as:
| 
| #include 
| #include "mpi.h"
| 
| int main(int argc, char* argv[])
| {
| int rank, size;
| 
| MPI_Init(&argc, &argv);
| MPI_Comm_rank(MPI_COMM_WORLD, &rank);
| MPI_Comm_size(MPI_COMM_WORLD, &size);
| printf("Hello, world, I am %d of %d\n", rank, size);
| MPI_Finalize();
| 
| return 0;
| }
| 
| If that works fine, then it is probably our problem, and not a problem with 
| your setup.
| 
| Sorry I don't have a better answer :(

That works (and I use the same Debian openmpi 1.2.3-1 set of packages Adam
has): 

edd@basebud:~> opalcc -o /tmp/openmpitest /tmp/openmpitest.c -lmpi
edd@basebud:~> orterun -np 4 /tmp/openmpitest
Hello, world, I am 2 of 4
Hello, world, I am 1 of 4
Hello, world, I am 0 of 4
Hello, world, I am 3 of 4
edd@basebud:~>

I was toying with this at work earlier, and it was hanging there (using
hostname or uptime as the token binaries) as soon as I increased the np
parameter beyond 1. 

It works here:

edd@basebud:~> orterun -np 4 hostname
basebud
basebud
basebud
basebud
edd@basebud:~>

I have slurm-llnl test packages installed at work but not here. Maybe I need
to a dig a bit more into slurm.  (Adam: slurm package should be forthcoming.
I can point you to the snapshots from the fellow whom I mentor on this.)

Dirk

-- 
Hell, there are no rules here - we're trying to accomplish something. 
  -- Thomas A. Edison