Re: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc

Stergiou, Jonathan C CIV NSWCCD West Bethesda, 6640 Tue, 12 Apr 2011 10:44:03 -0400

Apologies for not clarifying.  The behavior below is expected, I am just 
checking to see that Gemini will start-up and look for its input file.  When 
Gemini+OpenMPI is working correctly, I expect to see the behavior below.


When Gemini+OpenMPI is not working correctly (current behavior), I see the 
second behavior.  When running with "-np 1", Gemini will start-up and look for 
its input file.  When running with "-np 2" (or anything more than 1), Gemini 
never starts up.  Instead, the code simply hangs up indefinitely.  I showed 
Gemini as an example.  I don't believe the issue is Gemini-related, as I've 
reproduced the same "hanging" behavior with two other MPI codes (Salinas, 
ParaDyn). 

The same codebase runs correctly on many other workstations (transferred from 
my machine (build machine) to colleague's machine via "rsync -vrlpu 
/opt/sierra/ targetmachine:/opt/sierra"). 

I tried the following fixes, but still have problems: 

-Copy salinas (or geminimpi) locally, run "mpirun -np 2 ./salinas"
Tried running locally, both interactively and through queueing system.  No 
difference in behavior. 

-Compare "ldd salinas" and "ldd gemini" with functioning examples (examples 
from coworkers' workstations). 
Compared "ldd salinas" output (and "ldd geminimpi") with results from other 
workstations.  Comparisons look fine. 

-Create new user account with clean profile on my workstation.  Maybe it is an 
environment problem. 
Created new user account and sourced "/opt/sierra/install/sierra_init.sh" to 
set up path.  No difference in behavior. 

-Compare /etc/profile and /etc/bashrc with "functioning" examples. 
I compared my /etc/profile and /etc/bashrc with colleagues.  Comparisons don't 
raise any flags. 

I can provide other diagnostic-type information as requested. 

--
Jon Stergiou



-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Stergiou, Jonathan C CIV NSWCCD West Bethesda,6640
Sent: Monday, April 11, 2011 9:53
To: us...@open-mpi.org
Subject: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc

I am running OpenMPI 1.4.2 under RHEL 5.5.  After install, I tested with 
"mpirun -np 4 date"; the command returned four "date" outputs. 

Then I tried running two different MPI programs, "geminimpi" and "salinas".  
Both run correctly with "mpirun -np 1 $prog".  However, both hang indefinitely 
when I use anything other than "-np 1".  

Next, I ran "mpirun --debug-daemons -np 1 geminimpi" and got the following:  
(this looks good, and is what I would expect)

[code]
[xxx@XXX_TUX01 ~]$ mpirun --debug-daemons -np 1 geminimpi
[XXX_TUX01:06558] [[15027,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received add_local_procs
[XXX_TUX01:06558] [[15027,0],0] orted_recv: received sync+nidmap from local 
proc [[15027,1],0]
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs
Fluid Proc Ready: ID, FluidMaster,LagMaster =     0    0    1
 Checking license for Gemini
 Checking license for Linux OS
 Checking internal license list
 License valid

 GEMINI Startup
 Gemini +++ Version 5.1.00  20110501 +++    

 +++++ ERROR MESSAGE +++++
 FILE MISSING (Input): name = gemini.inp
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received waitpid_fired cmd
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received iof_complete cmd
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 6559 on
node XXX_TUX01 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received exit
[/code]

With "mpirun --debug-daemons -np 2 geminimpi", it hangs like this: (hangs 
indefinitely)

[code]
[xxx@XXX_TUX01 ~]$ mpirun --debug-daemons -np 2 geminimpi
[XXX_TUX01:06570] [[14983,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received add_local_procs
[XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local 
proc [[14983,1],1]
[XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local 
proc [[14983,1],0]
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received message_local_procs
[/code]


I cloned my entire installation to a number of other machines to test.  On all 
the other workstations, everything behaves correctly and various regression 
suites return good results. 

Any ideas? 

--
Jon Stergiou
Engineer
NSWC Carderock


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

smime.p7s
Description: S/MIME cryptographic signature

Re: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc

Reply via email to