Re: [OMPI users] Recursive use of "orterun"

2007-10-22 Thread Tim Prins
Hi Ides,

Thanks for the report and reminder. I have filed a ticket on this 
(https://svn.open-mpi.org/trac/ompi/ticket/1173) and you should receive email 
as it is updated.

I do not know of any more elegant way to work around this at the moment.

Thanks,

Tim

On Friday 19 October 2007 06:31:53 am idesbald van den bosch wrote:
> Hi,
>
> I've run into the same problem as discussed in the thread Lev Gelb: "Re:
> [OMPI users] Recursive use of "orterun" (Ralph H
> Castain)"<http://www.open-mpi.org/community/lists/users/2007/07/3655.php>
>
> I am running a parallel python code, then from python I launch a C++
> parallel program using the python os.system command, then I come back in
> python and keep going.
>
> With LAM/MPI there is no problem with this.
>
> But Open-mpi systematically crashes, because the python os.system command
> launches the C++ program with the same OMPI_* environment variables as for
> the Python program. As discussed in the thread, I have tried filtering the
> OMPI_* variables prior to launching the C++ program with an
> os.execvecommand, but then it fails to return the hand to python and
> instead simply
> terminates when the C++ program ends.
>
> There is a workaround (
> http://thread.gmane.org/gmane.comp.clustering.open-mpi.user/986): create a
> *.sh file with the following lines:
>
> 
> for i in $(env | grep OMPI_MCA |sed 's/=/ /' | awk '{print $1}')
> do
>unset $i
> done
>
> # now the C++ call
> mpirun -np 2  ./MoM/communicateMeshArrays
> --
>
> and then call the *.sh program through the python os.system command.
>
> What I would like to know is that if this "problem" will get fixed in
> open-MPI? Is there another way to elegantly solve this issue? Meanwhile, I
> will stick to the ugly *.sh hack listed above.
>
> Cheers
>
> Ides




[OMPI users] Recursive use of "orterun"

2007-10-19 Thread idesbald van den bosch
Hi,

I've run into the same problem as discussed in the thread Lev Gelb: "Re:
[OMPI users] Recursive use of "orterun" (Ralph H
Castain)"<http://www.open-mpi.org/community/lists/users/2007/07/3655.php>

I am running a parallel python code, then from python I launch a C++
parallel program using the python os.system command, then I come back in
python and keep going.

With LAM/MPI there is no problem with this.

But Open-mpi systematically crashes, because the python os.system command
launches the C++ program with the same OMPI_* environment variables as for
the Python program. As discussed in the thread, I have tried filtering the
OMPI_* variables prior to launching the C++ program with an
os.execvecommand, but then it fails to return the hand to python and
instead simply
terminates when the C++ program ends.

There is a workaround (
http://thread.gmane.org/gmane.comp.clustering.open-mpi.user/986): create a
*.sh file with the following lines:


for i in $(env | grep OMPI_MCA |sed 's/=/ /' | awk '{print $1}')
do
   unset $i
done

# now the C++ call
mpirun -np 2  ./MoM/communicateMeshArrays
--

and then call the *.sh program through the python os.system command.

What I would like to know is that if this "problem" will get fixed in
open-MPI? Is there another way to elegantly solve this issue? Meanwhile, I
will stick to the ugly *.sh hack listed above.

Cheers

Ides


Re: [OMPI users] Recursive use of "orterun" (Ralph H Castain)

2007-07-11 Thread Lev Gelb
utput is from loading that module; the next thing in
the code is the os.system call to start orterun with 2 processors.)

Also, there is absolutely no output from the second orterun-launched
program (even the first line does not execute.)

Cheers,

Lev




Message: 5
Date: Wed, 11 Jul 2007 13:26:22 -0600
From: Ralph H Castain <r...@lanl.gov>
Subject: Re: [OMPI users] Recursive use of "orterun"
To: "Open MPI Users <us...@open-mpi.org>" <us...@open-mpi.org>
Message-ID: <c2ba8afe.9e64%...@lanl.gov>
Content-Type: text/plain; charset="US-ASCII"

I'm unaware of any issues that would cause it to fail just because it is
being run via that interface.

The error message is telling us that the procs got launched, but then
orterun went away unexpectedly. Are you seeing your procs complete? We do
sometimes see that message due to a race condition between the daemons
spawned to support the application procs and orterun itself (see other
recent notes in this forum).

If your procs are not completing, then it would mean that either the
connecting fabric is failing for some reason, or orterun is terminating
early. If you could add --debug-daemons -mca odls_base_verbose 1 to the
os.system command, the output from that might help us understand why it is
failing.

Ralph



On 7/11/07 10:49 AM, "Lev Gelb" <g...@wuchem.wustl.edu> wrote:



Hi -

I'm trying to port an application to use OpenMPI, and running
into a problem.  The program (written in Python, parallelized
using either of "pypar" or "pyMPI") itself invokes "mpirun"
in order to manage external, parallel processes, via something like:

orterun -np 2 python myapp.py

where myapp.py contains:

os.system('orterun -np 2 nwchem.x nwchem.inp > nwchem.out')

I have this working under both LAM-MPI and MPICH on a variety
of different machines.  However, with OpenMPI,  all I get is an
immediate return from the system call and the error:

"OOB: Connection to HNP lost"

I have verified that the command passed to os.system is correct,
and even that it runs correctly if "myapp.py" doesn't invoke any
MPI calls of its own.

I'm testing openMPI on a single box, so there's no machinefile-stuff
currently
active.  The system is running Fedora Core 6 x86-64, I'm using the latest
openmpi-1.2.3-1.src.rpm rebuilt on the machine in question,
I can provide additional configuration details if necessary.

Thanks, in advance, for any help or advice,

Lev


--
Lev Gelb Associate Professor Department of Chemistry, Washington
University
in
St. Louis, St. Louis, MO 63130  USA

email: g...@wustl.edu
phone: (314)935-5026 fax:   (314)935-4481

http://www.chemistry.wustl.edu/~gelb
--

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Lev Gelb
Associate Professor
Department of Chemistry,
Washington University in St. Louis,
St. Louis, MO 63130  USA

email: g...@wustl.edu
phone: (314)935-5026
fax:   (314)935-4481

http://www.chemistry.wustl.edu/~gelb
--

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Lev Gelb 
Associate Professor 
Department of Chemistry, 
Washington University in St. Louis, 
St. Louis, MO 63130  USA


email: g...@wustl.edu
phone: (314)935-5026 
fax:   (314)935-4481


http://www.chemistry.wustl.edu/~gelb 
--




Re: [OMPI users] Recursive use of "orterun" (Ralph H Castain)

2007-07-11 Thread Ralph Castain
gt; [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: working on job -1
>>> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: checking child
>>> process [0,1,0]
>>> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: child is not alive
>>> 
>>> (the Pypar output is from loading that module; the next thing in
>>> the code is the os.system call to start orterun with 2 processors.)
>>> 
>>> Also, there is absolutely no output from the second orterun-launched
>>> program (even the first line does not execute.)
>>> 
>>> Cheers,
>>> 
>>> Lev
>>> 
>>> 
>>> 
>>>> Message: 5
>>>> Date: Wed, 11 Jul 2007 13:26:22 -0600
>>>> From: Ralph H Castain <r...@lanl.gov>
>>>> Subject: Re: [OMPI users] Recursive use of "orterun"
>>>> To: "Open MPI Users <us...@open-mpi.org>" <us...@open-mpi.org>
>>>> Message-ID: <c2ba8afe.9e64%...@lanl.gov>
>>>> Content-Type: text/plain; charset="US-ASCII"
>>>> 
>>>> I'm unaware of any issues that would cause it to fail just because it is
>>>> being run via that interface.
>>>> 
>>>> The error message is telling us that the procs got launched, but then
>>>> orterun went away unexpectedly. Are you seeing your procs complete? We do
>>>> sometimes see that message due to a race condition between the daemons
>>>> spawned to support the application procs and orterun itself (see other
>>>> recent notes in this forum).
>>>> 
>>>> If your procs are not completing, then it would mean that either the
>>>> connecting fabric is failing for some reason, or orterun is terminating
>>>> early. If you could add --debug-daemons -mca odls_base_verbose 1 to the
>>>> os.system command, the output from that might help us understand why it is
>>>> failing.
>>>> 
>>>> Ralph
>>>> 
>>>> 
>>>> 
>>>> On 7/11/07 10:49 AM, "Lev Gelb" <g...@wuchem.wustl.edu> wrote:
>>>> 
>>>>> 
>>>>> Hi -
>>>>> 
>>>>> I'm trying to port an application to use OpenMPI, and running
>>>>> into a problem.  The program (written in Python, parallelized
>>>>> using either of "pypar" or "pyMPI") itself invokes "mpirun"
>>>>> in order to manage external, parallel processes, via something like:
>>>>> 
>>>>> orterun -np 2 python myapp.py
>>>>> 
>>>>> where myapp.py contains:
>>>>> 
>>>>> os.system('orterun -np 2 nwchem.x nwchem.inp > nwchem.out')
>>>>> 
>>>>> I have this working under both LAM-MPI and MPICH on a variety
>>>>> of different machines.  However, with OpenMPI,  all I get is an
>>>>> immediate return from the system call and the error:
>>>>> 
>>>>> "OOB: Connection to HNP lost"
>>>>> 
>>>>> I have verified that the command passed to os.system is correct,
>>>>> and even that it runs correctly if "myapp.py" doesn't invoke any
>>>>> MPI calls of its own.
>>>>> 
>>>>> I'm testing openMPI on a single box, so there's no machinefile-stuff
>>>>> currently
>>>>> active.  The system is running Fedora Core 6 x86-64, I'm using the latest
>>>>> openmpi-1.2.3-1.src.rpm rebuilt on the machine in question,
>>>>> I can provide additional configuration details if necessary.
>>>>> 
>>>>> Thanks, in advance, for any help or advice,
>>>>> 
>>>>> Lev
>>>>> 
>>>>> 
>>>>> --
>>>>> Lev Gelb Associate Professor Department of Chemistry, Washington
>>>>> University
>>>>> in
>>>>> St. Louis, St. Louis, MO 63130  USA
>>>>> 
>>>>> email: g...@wustl.edu
>>>>> phone: (314)935-5026 fax:   (314)935-4481
>>>>> 
>>>>> http://www.chemistry.wustl.edu/~gelb
>>>>> --
>>>>> 
>>>>> ___
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> --
> Lev Gelb 
> Associate Professor
> Department of Chemistry,
> Washington University in St. Louis,
> St. Louis, MO 63130  USA
> 
> email: g...@wustl.edu
> phone: (314)935-5026
> fax:   (314)935-4481
> 
> http://www.chemistry.wustl.edu/~gelb
> --
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Recursive use of "orterun" (Ralph H Castain)

2007-07-11 Thread Lev Gelb


Thanks for the suggestions.  The separate 'orted' scheme (below) did
not work, unfortunately;  same behavior as before.  I have conducted
a few other simple tests, and found:

1.  The problem only occurs if the first process is "in" MPI;
if it doesn't call MPI_Init or calls MPI_Finalize before it executes
the second orterun, everything works.

2.  Whether or not the second process actually uses MPI doesn't matter.

3.  Using the standalone orted in "debug" mode with "universe"
specified throughout, there does not appear to be any communication to 
orted upon the second invocation of orterun


(Also, I've been able to get working nested orteruns using simple shell 
scripts, but these don't call MPI_Init.)


Cheers,

Lev



On Wed, 11 Jul 2007, Ralph H Castain wrote:


Hmmm...well, what that indicates is that your application program is losing
the connection to orterun, but that orterun is still alive and kicking (it
is alive enough to send the [0,0,1] daemon a message ordering it to exit).
So the question is: why is your application program dropping the connection?

I haven't tried doing embedded orterun commands, so there could be a
conflict there that causes the OOB connection to drop. Best guess is that
there is confusion over which orterun it is supposed to connect to. I can
give it a try and see - this may not be a mode we can support.

Alternatively, you could start a persistent daemon and then just allow both
orterun instances to report to it. Our method for doing that isn't as
convenient as we want it to be, and hope to soon have it, but it does work.
What you have to do is:

1. to start the persistent daemon, type:

"orted --seed --persistent --scope public --universe foo"

where foo can be whatever name you like.

2. when you execute your application, use:

orterun -np 1 --universe foo python ./test.py

where the "foo" matches the name given above.

3. in your os.system command, you'll need that same "--universe foo" option

That may solve the problem (let me know if it does). Meantime, I'll take a
look at the embedded approach without the persistent daemon...may take me
awhile as I'm in the middle of something, but I will try to get to it
shortly.

Ralph


On 7/11/07 1:40 PM, "Lev Gelb" <g...@wuchem.wustl.edu> wrote:



OK, I've added the debug flags - when I add them to the
os.system instance of orterun, there is no additional input,
but when I add them to the orterun instance controlling the
python program, I get the following:


orterun -np 1 --debug-daemons -mca odls_base_verbose 1 python ./test.py

Daemon [0,0,1] checking in as pid 18054 on host druid.wustl.edu
[druid.wustl.edu:18054] [0,0,1] orted: received launch callback
[druid.wustl.edu:18054] odls: setting up launch for job 1
[druid.wustl.edu:18054] odls: overriding oversubscription
[druid.wustl.edu:18054] odls: oversubscribed set to false want_processor
set to true
[druid.wustl.edu:18054] odls: preparing to launch child [0, 1, 0]
Pypar (version 1.9.3) initialised MPI OK with 1 processors
[druid.wustl.edu:18057] OOB: Connection to HNP lost
[druid.wustl.edu:18054] odls: child process terminated
[druid.wustl.edu:18054] odls: child process [0,1,0] terminated normally
[druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received message from
[0,0,0]
[druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received exit
[druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: working on job -1
[druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: checking child
process [0,1,0]
[druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: child is not alive

(the Pypar output is from loading that module; the next thing in
the code is the os.system call to start orterun with 2 processors.)

Also, there is absolutely no output from the second orterun-launched
program (even the first line does not execute.)

Cheers,

Lev




Message: 5
Date: Wed, 11 Jul 2007 13:26:22 -0600
From: Ralph H Castain <r...@lanl.gov>
Subject: Re: [OMPI users] Recursive use of "orterun"
To: "Open MPI Users <us...@open-mpi.org>" <us...@open-mpi.org>
Message-ID: <c2ba8afe.9e64%...@lanl.gov>
Content-Type: text/plain; charset="US-ASCII"

I'm unaware of any issues that would cause it to fail just because it is
being run via that interface.

The error message is telling us that the procs got launched, but then
orterun went away unexpectedly. Are you seeing your procs complete? We do
sometimes see that message due to a race condition between the daemons
spawned to support the application procs and orterun itself (see other
recent notes in this forum).

If your procs are not completing, then it would mean that either the
connecting fabric is failing for some reason, or orterun is terminating
early. If you could add --debug-daemons -mca odls_base_verbose 1 to the
os.system command, the output from that might help us understand why it is
failing.

Ralph



On 

Re: [OMPI users] Recursive use of "orterun" (Ralph H Castain)

2007-07-11 Thread Ralph H Castain
Hmmm...well, what that indicates is that your application program is losing
the connection to orterun, but that orterun is still alive and kicking (it
is alive enough to send the [0,0,1] daemon a message ordering it to exit).
So the question is: why is your application program dropping the connection?

I haven't tried doing embedded orterun commands, so there could be a
conflict there that causes the OOB connection to drop. Best guess is that
there is confusion over which orterun it is supposed to connect to. I can
give it a try and see - this may not be a mode we can support.

Alternatively, you could start a persistent daemon and then just allow both
orterun instances to report to it. Our method for doing that isn't as
convenient as we want it to be, and hope to soon have it, but it does work.
What you have to do is:

1. to start the persistent daemon, type:

"orted --seed --persistent --scope public --universe foo"

where foo can be whatever name you like.

2. when you execute your application, use:

orterun -np 1 --universe foo python ./test.py

where the "foo" matches the name given above.

3. in your os.system command, you'll need that same "--universe foo" option

That may solve the problem (let me know if it does). Meantime, I'll take a
look at the embedded approach without the persistent daemon...may take me
awhile as I'm in the middle of something, but I will try to get to it
shortly.

Ralph


On 7/11/07 1:40 PM, "Lev Gelb" <g...@wuchem.wustl.edu> wrote:

> 
> OK, I've added the debug flags - when I add them to the
> os.system instance of orterun, there is no additional input,
> but when I add them to the orterun instance controlling the
> python program, I get the following:
> 
>> orterun -np 1 --debug-daemons -mca odls_base_verbose 1 python ./test.py
> Daemon [0,0,1] checking in as pid 18054 on host druid.wustl.edu
> [druid.wustl.edu:18054] [0,0,1] orted: received launch callback
> [druid.wustl.edu:18054] odls: setting up launch for job 1
> [druid.wustl.edu:18054] odls: overriding oversubscription
> [druid.wustl.edu:18054] odls: oversubscribed set to false want_processor
> set to true
> [druid.wustl.edu:18054] odls: preparing to launch child [0, 1, 0]
> Pypar (version 1.9.3) initialised MPI OK with 1 processors
> [druid.wustl.edu:18057] OOB: Connection to HNP lost
> [druid.wustl.edu:18054] odls: child process terminated
> [druid.wustl.edu:18054] odls: child process [0,1,0] terminated normally
> [druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received message from
> [0,0,0]
> [druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received exit
> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: working on job -1
> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: checking child
> process [0,1,0]
> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: child is not alive
> 
> (the Pypar output is from loading that module; the next thing in
> the code is the os.system call to start orterun with 2 processors.)
> 
> Also, there is absolutely no output from the second orterun-launched
> program (even the first line does not execute.)
> 
> Cheers,
> 
> Lev
> 
> 
> 
>> Message: 5
>> Date: Wed, 11 Jul 2007 13:26:22 -0600
>> From: Ralph H Castain <r...@lanl.gov>
>> Subject: Re: [OMPI users] Recursive use of "orterun"
>> To: "Open MPI Users <us...@open-mpi.org>" <us...@open-mpi.org>
>> Message-ID: <c2ba8afe.9e64%...@lanl.gov>
>> Content-Type: text/plain; charset="US-ASCII"
>> 
>> I'm unaware of any issues that would cause it to fail just because it is
>> being run via that interface.
>> 
>> The error message is telling us that the procs got launched, but then
>> orterun went away unexpectedly. Are you seeing your procs complete? We do
>> sometimes see that message due to a race condition between the daemons
>> spawned to support the application procs and orterun itself (see other
>> recent notes in this forum).
>> 
>> If your procs are not completing, then it would mean that either the
>> connecting fabric is failing for some reason, or orterun is terminating
>> early. If you could add --debug-daemons -mca odls_base_verbose 1 to the
>> os.system command, the output from that might help us understand why it is
>> failing.
>> 
>> Ralph
>> 
>> 
>> 
>> On 7/11/07 10:49 AM, "Lev Gelb" <g...@wuchem.wustl.edu> wrote:
>> 
>>> 
>>> Hi -
>>> 
>>> I'm trying to port an application to use OpenMPI, and running
>>> into a problem.  The program (written in Python, parallelized
>>> using either of "pypar" or "pyMPI") itself invokes "mpir

Re: [OMPI users] Recursive use of "orterun" (Ralph H Castain)

2007-07-11 Thread Lev Gelb


OK, I've added the debug flags - when I add them to the
os.system instance of orterun, there is no additional input,
but when I add them to the orterun instance controlling the
python program, I get the following:


orterun -np 1 --debug-daemons -mca odls_base_verbose 1 python ./test.py

Daemon [0,0,1] checking in as pid 18054 on host druid.wustl.edu
[druid.wustl.edu:18054] [0,0,1] orted: received launch callback
[druid.wustl.edu:18054] odls: setting up launch for job 1
[druid.wustl.edu:18054] odls: overriding oversubscription
[druid.wustl.edu:18054] odls: oversubscribed set to false want_processor 
set to true

[druid.wustl.edu:18054] odls: preparing to launch child [0, 1, 0]
Pypar (version 1.9.3) initialised MPI OK with 1 processors
[druid.wustl.edu:18057] OOB: Connection to HNP lost
[druid.wustl.edu:18054] odls: child process terminated
[druid.wustl.edu:18054] odls: child process [0,1,0] terminated normally
[druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received message from 
[0,0,0]

[druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received exit
[druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: working on job -1
[druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: checking child 
process [0,1,0]

[druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: child is not alive

(the Pypar output is from loading that module; the next thing in
the code is the os.system call to start orterun with 2 processors.)

Also, there is absolutely no output from the second orterun-launched
program (even the first line does not execute.)

Cheers,

Lev




Message: 5
Date: Wed, 11 Jul 2007 13:26:22 -0600
From: Ralph H Castain <r...@lanl.gov>
Subject: Re: [OMPI users] Recursive use of "orterun"
To: "Open MPI Users <us...@open-mpi.org>" <us...@open-mpi.org>
Message-ID: <c2ba8afe.9e64%...@lanl.gov>
Content-Type: text/plain;   charset="US-ASCII"

I'm unaware of any issues that would cause it to fail just because it is
being run via that interface.

The error message is telling us that the procs got launched, but then
orterun went away unexpectedly. Are you seeing your procs complete? We do
sometimes see that message due to a race condition between the daemons
spawned to support the application procs and orterun itself (see other
recent notes in this forum).

If your procs are not completing, then it would mean that either the
connecting fabric is failing for some reason, or orterun is terminating
early. If you could add --debug-daemons -mca odls_base_verbose 1 to the
os.system command, the output from that might help us understand why it is
failing.

Ralph



On 7/11/07 10:49 AM, "Lev Gelb" <g...@wuchem.wustl.edu> wrote:



Hi -

I'm trying to port an application to use OpenMPI, and running
into a problem.  The program (written in Python, parallelized
using either of "pypar" or "pyMPI") itself invokes "mpirun"
in order to manage external, parallel processes, via something like:

orterun -np 2 python myapp.py

where myapp.py contains:

os.system('orterun -np 2 nwchem.x nwchem.inp > nwchem.out')

I have this working under both LAM-MPI and MPICH on a variety
of different machines.  However, with OpenMPI,  all I get is an
immediate return from the system call and the error:

"OOB: Connection to HNP lost"

I have verified that the command passed to os.system is correct,
and even that it runs correctly if "myapp.py" doesn't invoke any
MPI calls of its own.

I'm testing openMPI on a single box, so there's no machinefile-stuff currently
active.  The system is running Fedora Core 6 x86-64, I'm using the latest
openmpi-1.2.3-1.src.rpm rebuilt on the machine in question,
I can provide additional configuration details if necessary.

Thanks, in advance, for any help or advice,

Lev


--
Lev Gelb Associate Professor Department of Chemistry, Washington University in
St. Louis, St. Louis, MO 63130  USA

email: g...@wustl.edu
phone: (314)935-5026 fax:   (314)935-4481

http://www.chemistry.wustl.edu/~gelb
--

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Recursive use of "orterun"

2007-07-11 Thread Ralph H Castain
I'm unaware of any issues that would cause it to fail just because it is
being run via that interface.

The error message is telling us that the procs got launched, but then
orterun went away unexpectedly. Are you seeing your procs complete? We do
sometimes see that message due to a race condition between the daemons
spawned to support the application procs and orterun itself (see other
recent notes in this forum).

If your procs are not completing, then it would mean that either the
connecting fabric is failing for some reason, or orterun is terminating
early. If you could add --debug-daemons -mca odls_base_verbose 1 to the
os.system command, the output from that might help us understand why it is
failing.

Ralph



On 7/11/07 10:49 AM, "Lev Gelb"  wrote:

> 
> Hi -
> 
> I'm trying to port an application to use OpenMPI, and running
> into a problem.  The program (written in Python, parallelized
> using either of "pypar" or "pyMPI") itself invokes "mpirun"
> in order to manage external, parallel processes, via something like:
> 
> orterun -np 2 python myapp.py
> 
> where myapp.py contains:
> 
> os.system('orterun -np 2 nwchem.x nwchem.inp > nwchem.out')
> 
> I have this working under both LAM-MPI and MPICH on a variety
> of different machines.  However, with OpenMPI,  all I get is an
> immediate return from the system call and the error:
> 
> "OOB: Connection to HNP lost"
> 
> I have verified that the command passed to os.system is correct,
> and even that it runs correctly if "myapp.py" doesn't invoke any
> MPI calls of its own.
> 
> I'm testing openMPI on a single box, so there's no machinefile-stuff currently
> active.  The system is running Fedora Core 6 x86-64, I'm using the latest
> openmpi-1.2.3-1.src.rpm rebuilt on the machine in question,
> I can provide additional configuration details if necessary.
> 
> Thanks, in advance, for any help or advice,
> 
> Lev
> 
> 
> --
> Lev Gelb Associate Professor Department of Chemistry, Washington University in
> St. Louis, St. Louis, MO 63130  USA
> 
> email: g...@wustl.edu
> phone: (314)935-5026 fax:   (314)935-4481
> 
> http://www.chemistry.wustl.edu/~gelb
> --
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users