OK, I've added the debug flags - when I add them to the
os.system instance of orterun, there is no additional input,
but when I add them to the orterun instance controlling the
python program, I get the following:

orterun -np 1 --debug-daemons -mca odls_base_verbose 1 python ./test.py
Daemon [0,0,1] checking in as pid 18054 on host druid.wustl.edu
[druid.wustl.edu:18054] [0,0,1] orted: received launch callback
[druid.wustl.edu:18054] odls: setting up launch for job 1
[druid.wustl.edu:18054] odls: overriding oversubscription
[druid.wustl.edu:18054] odls: oversubscribed set to false want_processor set to true
[druid.wustl.edu:18054] odls: preparing to launch child [0, 1, 0]
Pypar (version 1.9.3) initialised MPI OK with 1 processors
[druid.wustl.edu:18057] OOB: Connection to HNP lost
[druid.wustl.edu:18054] odls: child process terminated
[druid.wustl.edu:18054] odls: child process [0,1,0] terminated normally
[druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received message from [0,0,0]
[druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received exit
[druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: working on job -1
[druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: checking child process [0,1,0]
[druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: child is not alive

(the Pypar output is from loading that module; the next thing in
the code is the os.system call to start orterun with 2 processors.)

Also, there is absolutely no output from the second orterun-launched
program (even the first line does not execute.)

Cheers,

Lev



Message: 5
Date: Wed, 11 Jul 2007 13:26:22 -0600
From: Ralph H Castain <r...@lanl.gov>
Subject: Re: [OMPI users] Recursive use of "orterun"
To: "Open MPI Users <us...@open-mpi.org>" <us...@open-mpi.org>
Message-ID: <c2ba8afe.9e64%...@lanl.gov>
Content-Type: text/plain;       charset="US-ASCII"

I'm unaware of any issues that would cause it to fail just because it is
being run via that interface.

The error message is telling us that the procs got launched, but then
orterun went away unexpectedly. Are you seeing your procs complete? We do
sometimes see that message due to a race condition between the daemons
spawned to support the application procs and orterun itself (see other
recent notes in this forum).

If your procs are not completing, then it would mean that either the
connecting fabric is failing for some reason, or orterun is terminating
early. If you could add --debug-daemons -mca odls_base_verbose 1 to the
os.system command, the output from that might help us understand why it is
failing.

Ralph



On 7/11/07 10:49 AM, "Lev Gelb" <g...@wuchem.wustl.edu> wrote:


Hi -

I'm trying to port an application to use OpenMPI, and running
into a problem.  The program (written in Python, parallelized
using either of "pypar" or "pyMPI") itself invokes "mpirun"
in order to manage external, parallel processes, via something like:

    orterun -np 2 python myapp.py

where myapp.py contains:

    os.system('orterun -np 2 nwchem.x nwchem.inp > nwchem.out')

I have this working under both LAM-MPI and MPICH on a variety
of different machines.  However, with OpenMPI,  all I get is an
immediate return from the system call and the error:

"OOB: Connection to HNP lost"

I have verified that the command passed to os.system is correct,
and even that it runs correctly if "myapp.py" doesn't invoke any
MPI calls of its own.

I'm testing openMPI on a single box, so there's no machinefile-stuff currently
active.  The system is running Fedora Core 6 x86-64, I'm using the latest
openmpi-1.2.3-1.src.rpm rebuilt on the machine in question,
I can provide additional configuration details if necessary.

Thanks, in advance, for any help or advice,

Lev


------------------------------------------------------------------
Lev Gelb Associate Professor Department of Chemistry, Washington University in
St. Louis, St. Louis, MO 63130  USA

email: g...@wustl.edu
phone: (314)935-5026 fax:   (314)935-4481

http://www.chemistry.wustl.edu/~gelb
------------------------------------------------------------------

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to