Hmmm...interesting. As a cross-check on something - if you os.system, does your environment by any chance get copied across? Reason I ask: we set a number of environmental variables when orterun spawns a process. If you call orterun from within that process - and the new orterun sees the enviro variables from the parent process - then I can guarantee it won't work.
What you need is for os.system to start its child with a clean environment. I would imagine if you just os.system'd something that output'd the environment, that would suffice to identify the problem. If you see anything that starts with OMPI_MCA_..., then we are indeed doomed. Which would definitely explain why the persistent orted wouldn't help solve the problem. Ralph On 7/11/07 3:05 PM, "Lev Gelb" <g...@wuchem.wustl.edu> wrote: > > Thanks for the suggestions. The separate 'orted' scheme (below) did > not work, unfortunately; same behavior as before. I have conducted > a few other simple tests, and found: > > 1. The problem only occurs if the first process is "in" MPI; > if it doesn't call MPI_Init or calls MPI_Finalize before it executes > the second orterun, everything works. > > 2. Whether or not the second process actually uses MPI doesn't matter. > > 3. Using the standalone orted in "debug" mode with "universe" > specified throughout, there does not appear to be any communication to > orted upon the second invocation of orterun > > (Also, I've been able to get working nested orteruns using simple shell > scripts, but these don't call MPI_Init.) > > Cheers, > > Lev > > > > On Wed, 11 Jul 2007, Ralph H Castain wrote: > >> Hmmm...well, what that indicates is that your application program is losing >> the connection to orterun, but that orterun is still alive and kicking (it >> is alive enough to send the [0,0,1] daemon a message ordering it to exit). >> So the question is: why is your application program dropping the connection? >> >> I haven't tried doing embedded orterun commands, so there could be a >> conflict there that causes the OOB connection to drop. Best guess is that >> there is confusion over which orterun it is supposed to connect to. I can >> give it a try and see - this may not be a mode we can support. >> >> Alternatively, you could start a persistent daemon and then just allow both >> orterun instances to report to it. Our method for doing that isn't as >> convenient as we want it to be, and hope to soon have it, but it does work. >> What you have to do is: >> >> 1. to start the persistent daemon, type: >> >> "orted --seed --persistent --scope public --universe foo" >> >> where foo can be whatever name you like. >> >> 2. when you execute your application, use: >> >> orterun -np 1 --universe foo python ./test.py >> >> where the "foo" matches the name given above. >> >> 3. in your os.system command, you'll need that same "--universe foo" option >> >> That may solve the problem (let me know if it does). Meantime, I'll take a >> look at the embedded approach without the persistent daemon...may take me >> awhile as I'm in the middle of something, but I will try to get to it >> shortly. >> >> Ralph >> >> >> On 7/11/07 1:40 PM, "Lev Gelb" <g...@wuchem.wustl.edu> wrote: >> >>> >>> OK, I've added the debug flags - when I add them to the >>> os.system instance of orterun, there is no additional input, >>> but when I add them to the orterun instance controlling the >>> python program, I get the following: >>> >>>> orterun -np 1 --debug-daemons -mca odls_base_verbose 1 python ./test.py >>> Daemon [0,0,1] checking in as pid 18054 on host druid.wustl.edu >>> [druid.wustl.edu:18054] [0,0,1] orted: received launch callback >>> [druid.wustl.edu:18054] odls: setting up launch for job 1 >>> [druid.wustl.edu:18054] odls: overriding oversubscription >>> [druid.wustl.edu:18054] odls: oversubscribed set to false want_processor >>> set to true >>> [druid.wustl.edu:18054] odls: preparing to launch child [0, 1, 0] >>> Pypar (version 1.9.3) initialised MPI OK with 1 processors >>> [druid.wustl.edu:18057] OOB: Connection to HNP lost >>> [druid.wustl.edu:18054] odls: child process terminated >>> [druid.wustl.edu:18054] odls: child process [0,1,0] terminated normally >>> [druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received message from >>> [0,0,0] >>> [druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received exit >>> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: working on job -1 >>> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: checking child >>> process [0,1,0] >>> [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: child is not alive >>> >>> (the Pypar output is from loading that module; the next thing in >>> the code is the os.system call to start orterun with 2 processors.) >>> >>> Also, there is absolutely no output from the second orterun-launched >>> program (even the first line does not execute.) >>> >>> Cheers, >>> >>> Lev >>> >>> >>> >>>> Message: 5 >>>> Date: Wed, 11 Jul 2007 13:26:22 -0600 >>>> From: Ralph H Castain <r...@lanl.gov> >>>> Subject: Re: [OMPI users] Recursive use of "orterun" >>>> To: "Open MPI Users <us...@open-mpi.org>" <us...@open-mpi.org> >>>> Message-ID: <c2ba8afe.9e64%...@lanl.gov> >>>> Content-Type: text/plain; charset="US-ASCII" >>>> >>>> I'm unaware of any issues that would cause it to fail just because it is >>>> being run via that interface. >>>> >>>> The error message is telling us that the procs got launched, but then >>>> orterun went away unexpectedly. Are you seeing your procs complete? We do >>>> sometimes see that message due to a race condition between the daemons >>>> spawned to support the application procs and orterun itself (see other >>>> recent notes in this forum). >>>> >>>> If your procs are not completing, then it would mean that either the >>>> connecting fabric is failing for some reason, or orterun is terminating >>>> early. If you could add --debug-daemons -mca odls_base_verbose 1 to the >>>> os.system command, the output from that might help us understand why it is >>>> failing. >>>> >>>> Ralph >>>> >>>> >>>> >>>> On 7/11/07 10:49 AM, "Lev Gelb" <g...@wuchem.wustl.edu> wrote: >>>> >>>>> >>>>> Hi - >>>>> >>>>> I'm trying to port an application to use OpenMPI, and running >>>>> into a problem. The program (written in Python, parallelized >>>>> using either of "pypar" or "pyMPI") itself invokes "mpirun" >>>>> in order to manage external, parallel processes, via something like: >>>>> >>>>> orterun -np 2 python myapp.py >>>>> >>>>> where myapp.py contains: >>>>> >>>>> os.system('orterun -np 2 nwchem.x nwchem.inp > nwchem.out') >>>>> >>>>> I have this working under both LAM-MPI and MPICH on a variety >>>>> of different machines. However, with OpenMPI, all I get is an >>>>> immediate return from the system call and the error: >>>>> >>>>> "OOB: Connection to HNP lost" >>>>> >>>>> I have verified that the command passed to os.system is correct, >>>>> and even that it runs correctly if "myapp.py" doesn't invoke any >>>>> MPI calls of its own. >>>>> >>>>> I'm testing openMPI on a single box, so there's no machinefile-stuff >>>>> currently >>>>> active. The system is running Fedora Core 6 x86-64, I'm using the latest >>>>> openmpi-1.2.3-1.src.rpm rebuilt on the machine in question, >>>>> I can provide additional configuration details if necessary. >>>>> >>>>> Thanks, in advance, for any help or advice, >>>>> >>>>> Lev >>>>> >>>>> >>>>> ------------------------------------------------------------------ >>>>> Lev Gelb Associate Professor Department of Chemistry, Washington >>>>> University >>>>> in >>>>> St. Louis, St. Louis, MO 63130 USA >>>>> >>>>> email: g...@wustl.edu >>>>> phone: (314)935-5026 fax: (314)935-4481 >>>>> >>>>> http://www.chemistry.wustl.edu/~gelb >>>>> ------------------------------------------------------------------ >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > ------------------------------------------------------------------ > Lev Gelb > Associate Professor > Department of Chemistry, > Washington University in St. Louis, > St. Louis, MO 63130 USA > > email: g...@wustl.edu > phone: (314)935-5026 > fax: (314)935-4481 > > http://www.chemistry.wustl.edu/~gelb > ------------------------------------------------------------------ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users