Re: [OMPI users] Recursive use of "orterun" (Ralph H Castain)
Well done, that was exactly the problem - Python's os.environ passes the complete collection of shell variables. I tried a different os method (os.execve) , where I could specify the environment (I took out all the OMPI_* variables) and the second orterun call worked! Now I just need a cleaner way to reset the environment within the spawned process. (Or, a way to tell orterun to ignore/overwrite the existing OMPI_* variables...?) Thanks for your help, Lev On Wed, 11 Jul 2007, Ralph Castain wrote: Hmmm...interesting. As a cross-check on something - if you os.system, does your environment by any chance get copied across? Reason I ask: we set a number of environmental variables when orterun spawns a process. If you call orterun from within that process - and the new orterun sees the enviro variables from the parent process - then I can guarantee it won't work. What you need is for os.system to start its child with a clean environment. I would imagine if you just os.system'd something that output'd the environment, that would suffice to identify the problem. If you see anything that starts with OMPI_MCA_..., then we are indeed doomed. Which would definitely explain why the persistent orted wouldn't help solve the problem. Ralph On 7/11/07 3:05 PM, "Lev Gelb" <g...@wuchem.wustl.edu> wrote: Thanks for the suggestions. The separate 'orted' scheme (below) did not work, unfortunately; same behavior as before. I have conducted a few other simple tests, and found: 1. The problem only occurs if the first process is "in" MPI; if it doesn't call MPI_Init or calls MPI_Finalize before it executes the second orterun, everything works. 2. Whether or not the second process actually uses MPI doesn't matter. 3. Using the standalone orted in "debug" mode with "universe" specified throughout, there does not appear to be any communication to orted upon the second invocation of orterun (Also, I've been able to get working nested orteruns using simple shell scripts, but these don't call MPI_Init.) Cheers, Lev On Wed, 11 Jul 2007, Ralph H Castain wrote: Hmmm...well, what that indicates is that your application program is losing the connection to orterun, but that orterun is still alive and kicking (it is alive enough to send the [0,0,1] daemon a message ordering it to exit). So the question is: why is your application program dropping the connection? I haven't tried doing embedded orterun commands, so there could be a conflict there that causes the OOB connection to drop. Best guess is that there is confusion over which orterun it is supposed to connect to. I can give it a try and see - this may not be a mode we can support. Alternatively, you could start a persistent daemon and then just allow both orterun instances to report to it. Our method for doing that isn't as convenient as we want it to be, and hope to soon have it, but it does work. What you have to do is: 1. to start the persistent daemon, type: "orted --seed --persistent --scope public --universe foo" where foo can be whatever name you like. 2. when you execute your application, use: orterun -np 1 --universe foo python ./test.py where the "foo" matches the name given above. 3. in your os.system command, you'll need that same "--universe foo" option That may solve the problem (let me know if it does). Meantime, I'll take a look at the embedded approach without the persistent daemon...may take me awhile as I'm in the middle of something, but I will try to get to it shortly. Ralph On 7/11/07 1:40 PM, "Lev Gelb" <g...@wuchem.wustl.edu> wrote: OK, I've added the debug flags - when I add them to the os.system instance of orterun, there is no additional input, but when I add them to the orterun instance controlling the python program, I get the following: orterun -np 1 --debug-daemons -mca odls_base_verbose 1 python ./test.py Daemon [0,0,1] checking in as pid 18054 on host druid.wustl.edu [druid.wustl.edu:18054] [0,0,1] orted: received launch callback [druid.wustl.edu:18054] odls: setting up launch for job 1 [druid.wustl.edu:18054] odls: overriding oversubscription [druid.wustl.edu:18054] odls: oversubscribed set to false want_processor set to true [druid.wustl.edu:18054] odls: preparing to launch child [0, 1, 0] Pypar (version 1.9.3) initialised MPI OK with 1 processors [druid.wustl.edu:18057] OOB: Connection to HNP lost [druid.wustl.edu:18054] odls: child process terminated [druid.wustl.edu:18054] odls: child process [0,1,0] terminated normally [druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received message from [0,0,0] [druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received exit [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: working on job -1 [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: checking child process [0,1,0] [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: child is not alive (the Pypar o
Re: [OMPI users] Recursive use of "orterun" (Ralph H Castain)
Thanks for the suggestions. The separate 'orted' scheme (below) did not work, unfortunately; same behavior as before. I have conducted a few other simple tests, and found: 1. The problem only occurs if the first process is "in" MPI; if it doesn't call MPI_Init or calls MPI_Finalize before it executes the second orterun, everything works. 2. Whether or not the second process actually uses MPI doesn't matter. 3. Using the standalone orted in "debug" mode with "universe" specified throughout, there does not appear to be any communication to orted upon the second invocation of orterun (Also, I've been able to get working nested orteruns using simple shell scripts, but these don't call MPI_Init.) Cheers, Lev On Wed, 11 Jul 2007, Ralph H Castain wrote: Hmmm...well, what that indicates is that your application program is losing the connection to orterun, but that orterun is still alive and kicking (it is alive enough to send the [0,0,1] daemon a message ordering it to exit). So the question is: why is your application program dropping the connection? I haven't tried doing embedded orterun commands, so there could be a conflict there that causes the OOB connection to drop. Best guess is that there is confusion over which orterun it is supposed to connect to. I can give it a try and see - this may not be a mode we can support. Alternatively, you could start a persistent daemon and then just allow both orterun instances to report to it. Our method for doing that isn't as convenient as we want it to be, and hope to soon have it, but it does work. What you have to do is: 1. to start the persistent daemon, type: "orted --seed --persistent --scope public --universe foo" where foo can be whatever name you like. 2. when you execute your application, use: orterun -np 1 --universe foo python ./test.py where the "foo" matches the name given above. 3. in your os.system command, you'll need that same "--universe foo" option That may solve the problem (let me know if it does). Meantime, I'll take a look at the embedded approach without the persistent daemon...may take me awhile as I'm in the middle of something, but I will try to get to it shortly. Ralph On 7/11/07 1:40 PM, "Lev Gelb" <g...@wuchem.wustl.edu> wrote: OK, I've added the debug flags - when I add them to the os.system instance of orterun, there is no additional input, but when I add them to the orterun instance controlling the python program, I get the following: orterun -np 1 --debug-daemons -mca odls_base_verbose 1 python ./test.py Daemon [0,0,1] checking in as pid 18054 on host druid.wustl.edu [druid.wustl.edu:18054] [0,0,1] orted: received launch callback [druid.wustl.edu:18054] odls: setting up launch for job 1 [druid.wustl.edu:18054] odls: overriding oversubscription [druid.wustl.edu:18054] odls: oversubscribed set to false want_processor set to true [druid.wustl.edu:18054] odls: preparing to launch child [0, 1, 0] Pypar (version 1.9.3) initialised MPI OK with 1 processors [druid.wustl.edu:18057] OOB: Connection to HNP lost [druid.wustl.edu:18054] odls: child process terminated [druid.wustl.edu:18054] odls: child process [0,1,0] terminated normally [druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received message from [0,0,0] [druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received exit [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: working on job -1 [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: checking child process [0,1,0] [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: child is not alive (the Pypar output is from loading that module; the next thing in the code is the os.system call to start orterun with 2 processors.) Also, there is absolutely no output from the second orterun-launched program (even the first line does not execute.) Cheers, Lev Message: 5 Date: Wed, 11 Jul 2007 13:26:22 -0600 From: Ralph H Castain <r...@lanl.gov> Subject: Re: [OMPI users] Recursive use of "orterun" To: "Open MPI Users <us...@open-mpi.org>" <us...@open-mpi.org> Message-ID: <c2ba8afe.9e64%...@lanl.gov> Content-Type: text/plain; charset="US-ASCII" I'm unaware of any issues that would cause it to fail just because it is being run via that interface. The error message is telling us that the procs got launched, but then orterun went away unexpectedly. Are you seeing your procs complete? We do sometimes see that message due to a race condition between the daemons spawned to support the application procs and orterun itself (see other recent notes in this forum). If your procs are not completing, then it would mean that either the connecting fabric is failing for some reason, or orterun is terminating early. If you could add --debug-daemons -mca odls_base_verbose 1 to the os.system command, the output from that might help us understand why it is failing. Ralph On
Re: [OMPI users] Recursive use of "orterun" (Ralph H Castain)
OK, I've added the debug flags - when I add them to the os.system instance of orterun, there is no additional input, but when I add them to the orterun instance controlling the python program, I get the following: orterun -np 1 --debug-daemons -mca odls_base_verbose 1 python ./test.py Daemon [0,0,1] checking in as pid 18054 on host druid.wustl.edu [druid.wustl.edu:18054] [0,0,1] orted: received launch callback [druid.wustl.edu:18054] odls: setting up launch for job 1 [druid.wustl.edu:18054] odls: overriding oversubscription [druid.wustl.edu:18054] odls: oversubscribed set to false want_processor set to true [druid.wustl.edu:18054] odls: preparing to launch child [0, 1, 0] Pypar (version 1.9.3) initialised MPI OK with 1 processors [druid.wustl.edu:18057] OOB: Connection to HNP lost [druid.wustl.edu:18054] odls: child process terminated [druid.wustl.edu:18054] odls: child process [0,1,0] terminated normally [druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received message from [0,0,0] [druid.wustl.edu:18054] [0,0,1] orted_recv_pls: received exit [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: working on job -1 [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: checking child process [0,1,0] [druid.wustl.edu:18054] [0,0,1] odls_kill_local_proc: child is not alive (the Pypar output is from loading that module; the next thing in the code is the os.system call to start orterun with 2 processors.) Also, there is absolutely no output from the second orterun-launched program (even the first line does not execute.) Cheers, Lev Message: 5 Date: Wed, 11 Jul 2007 13:26:22 -0600 From: Ralph H Castain <r...@lanl.gov> Subject: Re: [OMPI users] Recursive use of "orterun" To: "Open MPI Users <us...@open-mpi.org>" <us...@open-mpi.org> Message-ID: <c2ba8afe.9e64%...@lanl.gov> Content-Type: text/plain; charset="US-ASCII" I'm unaware of any issues that would cause it to fail just because it is being run via that interface. The error message is telling us that the procs got launched, but then orterun went away unexpectedly. Are you seeing your procs complete? We do sometimes see that message due to a race condition between the daemons spawned to support the application procs and orterun itself (see other recent notes in this forum). If your procs are not completing, then it would mean that either the connecting fabric is failing for some reason, or orterun is terminating early. If you could add --debug-daemons -mca odls_base_verbose 1 to the os.system command, the output from that might help us understand why it is failing. Ralph On 7/11/07 10:49 AM, "Lev Gelb" <g...@wuchem.wustl.edu> wrote: Hi - I'm trying to port an application to use OpenMPI, and running into a problem. The program (written in Python, parallelized using either of "pypar" or "pyMPI") itself invokes "mpirun" in order to manage external, parallel processes, via something like: orterun -np 2 python myapp.py where myapp.py contains: os.system('orterun -np 2 nwchem.x nwchem.inp > nwchem.out') I have this working under both LAM-MPI and MPICH on a variety of different machines. However, with OpenMPI, all I get is an immediate return from the system call and the error: "OOB: Connection to HNP lost" I have verified that the command passed to os.system is correct, and even that it runs correctly if "myapp.py" doesn't invoke any MPI calls of its own. I'm testing openMPI on a single box, so there's no machinefile-stuff currently active. The system is running Fedora Core 6 x86-64, I'm using the latest openmpi-1.2.3-1.src.rpm rebuilt on the machine in question, I can provide additional configuration details if necessary. Thanks, in advance, for any help or advice, Lev -- Lev Gelb Associate Professor Department of Chemistry, Washington University in St. Louis, St. Louis, MO 63130 USA email: g...@wustl.edu phone: (314)935-5026 fax: (314)935-4481 http://www.chemistry.wustl.edu/~gelb -- ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users