Re: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc
All, It looks like the issue is solved. Our sysadmin had been working on the issue too - he noticed a lot of "junk" in my /etc/ld.so.conf.d/ directory. After "cleaning" it out (I think he ended up wiping everything out, then rebooting the machine, then re-configuring specific items as needed), my OpenMPI installation is working fine. I can now run "mpirun -np # hello_c" where # is any integer. The same holds true for our specialized applications (Gemini, Salinas, etc). Apologies - I don't know why "cleaning" this directory fixed things. I'm also not sure why OpenMPI stopped working in the first place. The timing seems to coincide with two updates to my machine; the kernel, and subsequently the Nvidia driver, were both updated right before "mpirun" stopped working correctly. The sysadmin mentioned it could be related to ldconfig. Again, I don't know why this would cause "mpirun" to misbehave. However, everything appears to work correctly now. Thank you for your help, and hopefully this thread proves useful to someone in the future. -- Jon Stergiou Engineer NSWC Carderock -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Tuesday, April 12, 2011 11:38 To: Open MPI Users Subject: Re: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc Okay, that says that mpirun is working correctly - the problem appears to be in MPI_Init. How was OMPI configured? On Apr 12, 2011, at 9:24 AM, Stergiou, Jonathan C CIV NSWCCD West Bethesda, 6640 wrote: > Ralph, > > Thanks for the reply and guidance. > > I ran the following: > > $> mpirun -np 1 hostname > XXX_TUX01 > > $> mpirun -np 2 hostname > XXX_TUX01 > XXX_TUX01 > > $> mpirun -np 1 ./hello_c > Hello, world, I am 0 of 1. > > $> mpirun -np 2 ./hello_c > (no result, terminal does not respond until ctrl+c) > > > >> Let's simplify the issue as we have no idea what your codes are doing. >> >> Can you run two copies of hostname, for example? >> >> What about multiple copies of an MPI version of "hello" - see the examples >> directory in the OMPI tarball. > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc
Okay, that says that mpirun is working correctly - the problem appears to be in MPI_Init. How was OMPI configured? On Apr 12, 2011, at 9:24 AM, Stergiou, Jonathan C CIV NSWCCD West Bethesda, 6640 wrote: > Ralph, > > Thanks for the reply and guidance. > > I ran the following: > > $> mpirun -np 1 hostname > XXX_TUX01 > > $> mpirun -np 2 hostname > XXX_TUX01 > XXX_TUX01 > > $> mpirun -np 1 ./hello_c > Hello, world, I am 0 of 1. > > $> mpirun -np 2 ./hello_c > (no result, terminal does not respond until ctrl+c) > > > >> Let's simplify the issue as we have no idea what your codes are doing. >> >> Can you run two copies of hostname, for example? >> >> What about multiple copies of an MPI version of "hello" - see the examples >> directory in the OMPI tarball. > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc
Ralph, Thanks for the reply and guidance. I ran the following: $> mpirun -np 1 hostname XXX_TUX01 $> mpirun -np 2 hostname XXX_TUX01 XXX_TUX01 $> mpirun -np 1 ./hello_c Hello, world, I am 0 of 1. $> mpirun -np 2 ./hello_c (no result, terminal does not respond until ctrl+c) > Let's simplify the issue as we have no idea what your codes are doing. > > Can you run two copies of hostname, for example? > > What about multiple copies of an MPI version of "hello" - see the examples > directory in the OMPI tarball. smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc
Let's simplify the issue as we have no idea what your codes are doing. Can you run two copies of hostname, for example? What about multiple copies of an MPI version of "hello" - see the examples directory in the OMPI tarball. On Apr 12, 2011, at 8:43 AM, Stergiou, Jonathan C CIV NSWCCD West Bethesda, 6640 wrote: > Apologies for not clarifying. The behavior below is expected, I am just > checking to see that Gemini will start-up and look for its input file. When > Gemini+OpenMPI is working correctly, I expect to see the behavior below. > > When Gemini+OpenMPI is not working correctly (current behavior), I see the > second behavior. When running with "-np 1", Gemini will start-up and look > for its input file. When running with "-np 2" (or anything more than 1), > Gemini never starts up. Instead, the code simply hangs up indefinitely. I > showed Gemini as an example. I don't believe the issue is Gemini-related, as > I've reproduced the same "hanging" behavior with two other MPI codes > (Salinas, ParaDyn). > > The same codebase runs correctly on many other workstations (transferred from > my machine (build machine) to colleague's machine via "rsync -vrlpu > /opt/sierra/ targetmachine:/opt/sierra"). > > I tried the following fixes, but still have problems: > > -Copy salinas (or geminimpi) locally, run "mpirun -np 2 ./salinas" > Tried running locally, both interactively and through queueing system. No > difference in behavior. > > -Compare "ldd salinas" and "ldd gemini" with functioning examples (examples > from coworkers' workstations). > Compared "ldd salinas" output (and "ldd geminimpi") with results from other > workstations. Comparisons look fine. > > -Create new user account with clean profile on my workstation. Maybe it is > an environment problem. > Created new user account and sourced "/opt/sierra/install/sierra_init.sh" to > set up path. No difference in behavior. > > -Compare /etc/profile and /etc/bashrc with "functioning" examples. > I compared my /etc/profile and /etc/bashrc with colleagues. Comparisons > don't raise any flags. > > I can provide other diagnostic-type information as requested. > > -- > Jon Stergiou > > > > -----Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Stergiou, Jonathan C CIV NSWCCD West Bethesda,6640 > Sent: Monday, April 11, 2011 9:53 > To: us...@open-mpi.org > Subject: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc > > I am running OpenMPI 1.4.2 under RHEL 5.5. After install, I tested with > "mpirun -np 4 date"; the command returned four "date" outputs. > > Then I tried running two different MPI programs, "geminimpi" and "salinas". > Both run correctly with "mpirun -np 1 $prog". However, both hang > indefinitely when I use anything other than "-np 1". > > Next, I ran "mpirun --debug-daemons -np 1 geminimpi" and got the following: > (this looks good, and is what I would expect) > > [code] > [xxx@XXX_TUX01 ~]$ mpirun --debug-daemons -np 1 geminimpi > [XXX_TUX01:06558] [[15027,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200 > [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received add_local_procs > [XXX_TUX01:06558] [[15027,0],0] orted_recv: received sync+nidmap from local > proc [[15027,1],0] > [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd > [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs > [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd > [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs > Fluid Proc Ready: ID, FluidMaster,LagMaster = 001 > Checking license for Gemini > Checking license for Linux OS > Checking internal license list > License valid > > GEMINI Startup > Gemini +++ Version 5.1.00 20110501 +++ > > + ERROR MESSAGE + > FILE MISSING (Input): name = gemini.inp > [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received waitpid_fired cmd > [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received iof_complete cmd > -- > mpirun has exited due to process rank 0 with PID 6559 on > node XXX_TUX01 exiting without calling "finalize". This may > have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > -- > [XXX_TUX01:06558]
Re: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc
Apologies for not clarifying. The behavior below is expected, I am just checking to see that Gemini will start-up and look for its input file. When Gemini+OpenMPI is working correctly, I expect to see the behavior below. When Gemini+OpenMPI is not working correctly (current behavior), I see the second behavior. When running with "-np 1", Gemini will start-up and look for its input file. When running with "-np 2" (or anything more than 1), Gemini never starts up. Instead, the code simply hangs up indefinitely. I showed Gemini as an example. I don't believe the issue is Gemini-related, as I've reproduced the same "hanging" behavior with two other MPI codes (Salinas, ParaDyn). The same codebase runs correctly on many other workstations (transferred from my machine (build machine) to colleague's machine via "rsync -vrlpu /opt/sierra/ targetmachine:/opt/sierra"). I tried the following fixes, but still have problems: -Copy salinas (or geminimpi) locally, run "mpirun -np 2 ./salinas" Tried running locally, both interactively and through queueing system. No difference in behavior. -Compare "ldd salinas" and "ldd gemini" with functioning examples (examples from coworkers' workstations). Compared "ldd salinas" output (and "ldd geminimpi") with results from other workstations. Comparisons look fine. -Create new user account with clean profile on my workstation. Maybe it is an environment problem. Created new user account and sourced "/opt/sierra/install/sierra_init.sh" to set up path. No difference in behavior. -Compare /etc/profile and /etc/bashrc with "functioning" examples. I compared my /etc/profile and /etc/bashrc with colleagues. Comparisons don't raise any flags. I can provide other diagnostic-type information as requested. -- Jon Stergiou -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Stergiou, Jonathan C CIV NSWCCD West Bethesda,6640 Sent: Monday, April 11, 2011 9:53 To: us...@open-mpi.org Subject: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc I am running OpenMPI 1.4.2 under RHEL 5.5. After install, I tested with "mpirun -np 4 date"; the command returned four "date" outputs. Then I tried running two different MPI programs, "geminimpi" and "salinas". Both run correctly with "mpirun -np 1 $prog". However, both hang indefinitely when I use anything other than "-np 1". Next, I ran "mpirun --debug-daemons -np 1 geminimpi" and got the following: (this looks good, and is what I would expect) [code] [xxx@XXX_TUX01 ~]$ mpirun --debug-daemons -np 1 geminimpi [XXX_TUX01:06558] [[15027,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200 [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received add_local_procs [XXX_TUX01:06558] [[15027,0],0] orted_recv: received sync+nidmap from local proc [[15027,1],0] [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs Fluid Proc Ready: ID, FluidMaster,LagMaster = 001 Checking license for Gemini Checking license for Linux OS Checking internal license list License valid GEMINI Startup Gemini +++ Version 5.1.00 20110501 +++ + ERROR MESSAGE + FILE MISSING (Input): name = gemini.inp [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received waitpid_fired cmd [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received iof_complete cmd -- mpirun has exited due to process rank 0 with PID 6559 on node XXX_TUX01 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -- [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received exit [/code] With "mpirun --debug-daemons -np 2 geminimpi", it hangs like this: (hangs indefinitely) [code] [xxx@XXX_TUX01 ~]$ mpirun --debug-daemons -np 2 geminimpi [XXX_TUX01:06570] [[14983,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200 [XXX_TUX01:06570] [[14983,0],0] orted_cmd: received add_local_procs [XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local proc [[14983,1],1] [XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local proc [[14983,1],0] [XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd [XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd [XXX_TUX01:06570] [[14983,0],0] orted_cmd: received message_local_procs [/code] I cl
Re: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc
On Apr 11, 2011, at 9:53 AM, Stergiou, Jonathan C CIV NSWCCD West Bethesda, 6640 wrote: > + ERROR MESSAGE + > FILE MISSING (Input): name = gemini.inp This seems like a gemini error, not an Open MPI error. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc
I am running OpenMPI 1.4.2 under RHEL 5.5. After install, I tested with "mpirun -np 4 date"; the command returned four "date" outputs. Then I tried running two different MPI programs, "geminimpi" and "salinas". Both run correctly with "mpirun -np 1 $prog". However, both hang indefinitely when I use anything other than "-np 1". Next, I ran "mpirun --debug-daemons -np 1 geminimpi" and got the following: (this looks good, and is what I would expect) [code] [xxx@XXX_TUX01 ~]$ mpirun --debug-daemons -np 1 geminimpi [XXX_TUX01:06558] [[15027,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200 [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received add_local_procs [XXX_TUX01:06558] [[15027,0],0] orted_recv: received sync+nidmap from local proc [[15027,1],0] [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs Fluid Proc Ready: ID, FluidMaster,LagMaster = 001 Checking license for Gemini Checking license for Linux OS Checking internal license list License valid GEMINI Startup Gemini +++ Version 5.1.00 20110501 +++ + ERROR MESSAGE + FILE MISSING (Input): name = gemini.inp [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received waitpid_fired cmd [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received iof_complete cmd -- mpirun has exited due to process rank 0 with PID 6559 on node XXX_TUX01 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -- [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received exit [/code] With "mpirun --debug-daemons -np 2 geminimpi", it hangs like this: (hangs indefinitely) [code] [xxx@XXX_TUX01 ~]$ mpirun --debug-daemons -np 2 geminimpi [XXX_TUX01:06570] [[14983,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200 [XXX_TUX01:06570] [[14983,0],0] orted_cmd: received add_local_procs [XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local proc [[14983,1],1] [XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local proc [[14983,1],0] [XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd [XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd [XXX_TUX01:06570] [[14983,0],0] orted_cmd: received message_local_procs [/code] I cloned my entire installation to a number of other machines to test. On all the other workstations, everything behaves correctly and various regression suites return good results. Any ideas? -- Jon Stergiou Engineer NSWC Carderock