Re: [OMPI users] Recursive use of "orterun" (Ralph H Castain)

Ralph Castain Wed, 11 Jul 2007 18:10:53 -0400

Hmmm...interesting. As a cross-check on something - if you os.system, does
your environment by any chance get copied across? Reason I ask: we set a
number of environmental variables when orterun spawns a process. If you call
orterun from within that process - and the new orterun sees the enviro
variables from the parent process - then I can guarantee it won't work.


What you need is for os.system to start its child with a clean environment.
I would imagine if you just os.system'd something that output'd the
environment, that would suffice to identify the problem. If you see anything
that starts with OMPI_MCA_..., then we are indeed doomed.

Which would definitely explain why the persistent orted wouldn't help solve
the problem.

Ralph



On 7/11/07 3:05 PM, "Lev Gelb" <g...@wuchem.wustl.edu> wrote:

> 
> Thanks for the suggestions.  The separate 'orted' scheme (below) did
> not work, unfortunately;  same behavior as before.  I have conducted
> a few other simple tests, and found:
> 
> 1.  The problem only occurs if the first process is "in" MPI;
> if it doesn't call MPI_Init or calls MPI_Finalize before it executes
> the second orterun, everything works.
> 
> 2.  Whether or not the second process actually uses MPI doesn't matter.
> 
> 3.  Using the standalone orted in "debug" mode with "universe"
> specified throughout, there does not appear to be any communication to
> orted upon the second invocation of orterun
> 
> (Also, I've been able to get working nested orteruns using simple shell
> scripts, but these don't call MPI_Init.)
> 
> Cheers,
> 
> Lev
> 
> 
> 
> On Wed, 11 Jul 2007, Ralph H Castain wrote:
> 
>> Hmmm...well, what that indicates is that your application program is losing
>> the connection to orterun, but that orterun is still alive and kicking (it
>> is alive enough to send the [0,0,1] daemon a message ordering it to exit).
>> So the question is: why is your application program dropping the connection?
>> 
>> I haven't tried doing embedded orterun commands, so there could be a
>> conflict there that causes the OOB connection to drop. Best guess is that
>> there is confusion over which orterun it is supposed to connect to. I can
>> give it a try and see - this may not be a mode we can support.
>> 
>> Alternatively, you could start a persistent daemon and then just allow both
>> orterun instances to report to it. Our method for doing that isn't as
>> convenient as we want it to be, and hope to soon have it, but it does work.
>> What you have to do is:
>> 
>> 1. to start the persistent daemon, type:
>> 
>> "orted --seed --persistent --scope public --universe foo"
>> 
>> where foo can be whatever name you like.
>> 
>> 2. when you execute your application, use:
>> 
>> orterun -np 1 --universe foo python ./test.py
>> 
>> where the "foo" matches the name given above.
>> 
>> 3. in your os.system command, you'll need that same "--universe foo" option
>> 
>> That may solve the problem (let me know if it does). Meantime, I'll take a
>> look at the embedded approach without the persistent daemon...may take me
>> awhile as I'm in the middle of something, but I will try to get to it
>> shortly.
>> 
>> Ralph
>> 
>> 
>> On 7/11/07 1:40 PM, "Lev Gelb" <g...@wuchem.wustl.edu> wrote:
>> 
>>> 
>>> OK, I've added the debug flags - when I add them to the
>>> os.system instance of orterun, there is no additional input,
>>> but when I add them to the orterun instance controlling the
>>> python program, I get the following:
>>> 
>>>> orterun -np 1 --debug-daemons -mca odls_base_verbose 1 python ./test.py
>>> Daemon [0,0,1] checking in as pid 18054 on host druid.wustl.edu
>>> [druid.wustl.edu:18054] [0,0,1] orted: received launch callback
>>> [druid.wustl.edu:18054] odls: setting up launch for job 1
>>> [druid.wustl.edu:18054] odls: overriding oversubscription
>>> [druid.wustl.edu:18054] odls: oversubscribed set to false want_processor
>>> set to true
>>> [druid.wustl.edu:18054] odls: preparing to launch child [0, 1, 0]
>>> Pypar (version 1.9.3) initialised MPI OK with 1 processors
>>> [druid.wustl.edu:18057] OOB: Connection to HNP lost
>>> [druid.wustl.edu:18054] odls: child process terminated
>>> [druid.wustl.edu:18054] odls: child process [0,1,0] terminated normally
>>> [druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received message from
>>> [0,0,0]
>>> [druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received exit
>>> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: working on job -1
>>> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: checking child
>>> process [0,1,0]
>>> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: child is not alive
>>> 
>>> (the Pypar output is from loading that module; the next thing in
>>> the code is the os.system call to start orterun with 2 processors.)
>>> 
>>> Also, there is absolutely no output from the second orterun-launched
>>> program (even the first line does not execute.)
>>> 
>>> Cheers,
>>> 
>>> Lev
>>> 
>>> 
>>> 
>>>> Message: 5
>>>> Date: Wed, 11 Jul 2007 13:26:22 -0600
>>>> From: Ralph H Castain <r...@lanl.gov>
>>>> Subject: Re: [OMPI users] Recursive use of "orterun"
>>>> To: "Open MPI Users <us...@open-mpi.org>" <us...@open-mpi.org>
>>>> Message-ID: <c2ba8afe.9e64%...@lanl.gov>
>>>> Content-Type: text/plain; charset="US-ASCII"
>>>> 
>>>> I'm unaware of any issues that would cause it to fail just because it is
>>>> being run via that interface.
>>>> 
>>>> The error message is telling us that the procs got launched, but then
>>>> orterun went away unexpectedly. Are you seeing your procs complete? We do
>>>> sometimes see that message due to a race condition between the daemons
>>>> spawned to support the application procs and orterun itself (see other
>>>> recent notes in this forum).
>>>> 
>>>> If your procs are not completing, then it would mean that either the
>>>> connecting fabric is failing for some reason, or orterun is terminating
>>>> early. If you could add --debug-daemons -mca odls_base_verbose 1 to the
>>>> os.system command, the output from that might help us understand why it is
>>>> failing.
>>>> 
>>>> Ralph
>>>> 
>>>> 
>>>> 
>>>> On 7/11/07 10:49 AM, "Lev Gelb" <g...@wuchem.wustl.edu> wrote:
>>>> 
>>>>> 
>>>>> Hi -
>>>>> 
>>>>> I'm trying to port an application to use OpenMPI, and running
>>>>> into a problem.  The program (written in Python, parallelized
>>>>> using either of "pypar" or "pyMPI") itself invokes "mpirun"
>>>>> in order to manage external, parallel processes, via something like:
>>>>> 
>>>>>     orterun -np 2 python myapp.py
>>>>> 
>>>>> where myapp.py contains:
>>>>> 
>>>>>     os.system('orterun -np 2 nwchem.x nwchem.inp > nwchem.out')
>>>>> 
>>>>> I have this working under both LAM-MPI and MPICH on a variety
>>>>> of different machines.  However, with OpenMPI,  all I get is an
>>>>> immediate return from the system call and the error:
>>>>> 
>>>>> "OOB: Connection to HNP lost"
>>>>> 
>>>>> I have verified that the command passed to os.system is correct,
>>>>> and even that it runs correctly if "myapp.py" doesn't invoke any
>>>>> MPI calls of its own.
>>>>> 
>>>>> I'm testing openMPI on a single box, so there's no machinefile-stuff
>>>>> currently
>>>>> active.  The system is running Fedora Core 6 x86-64, I'm using the latest
>>>>> openmpi-1.2.3-1.src.rpm rebuilt on the machine in question,
>>>>> I can provide additional configuration details if necessary.
>>>>> 
>>>>> Thanks, in advance, for any help or advice,
>>>>> 
>>>>> Lev
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------------
>>>>> Lev Gelb Associate Professor Department of Chemistry, Washington
>>>>> University
>>>>> in
>>>>> St. Louis, St. Louis, MO 63130  USA
>>>>> 
>>>>> email: g...@wustl.edu
>>>>> phone: (314)935-5026 fax:   (314)935-4481
>>>>> 
>>>>> http://www.chemistry.wustl.edu/~gelb
>>>>> ------------------------------------------------------------------
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> ------------------------------------------------------------------
> Lev Gelb 
> Associate Professor
> Department of Chemistry,
> Washington University in St. Louis,
> St. Louis, MO 63130  USA
> 
> email: g...@wustl.edu
> phone: (314)935-5026
> fax:   (314)935-4481
> 
> http://www.chemistry.wustl.edu/~gelb
> ------------------------------------------------------------------
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Recursive use of "orterun" (Ralph H Castain)

Reply via email to