Re: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc

2011-04-13 Thread Stergiou, Jonathan C CIV NSWCCD West Bethesda, 6640
All,

It looks like the issue is solved.  Our sysadmin had been working on the issue 
too - he noticed a lot of "junk" in my /etc/ld.so.conf.d/ directory.  After 
"cleaning" it out (I think he ended up wiping everything out, then rebooting 
the machine, then re-configuring specific items as needed), my OpenMPI 
installation is working fine. 

I can now run "mpirun -np # hello_c" where # is any integer.  The same holds 
true for our specialized applications (Gemini, Salinas, etc). 

Apologies - I don't know why "cleaning" this directory fixed things.  I'm also 
not sure why OpenMPI stopped working in the first place.  The timing seems to 
coincide with two updates to my machine; the kernel, and subsequently the 
Nvidia driver, were both updated right before "mpirun" stopped working 
correctly. 

The sysadmin mentioned it could be related to ldconfig.  Again, I don't know 
why this would cause "mpirun" to misbehave.  However, everything appears to 
work correctly now. 

Thank you for your help, and hopefully this thread proves useful to someone in 
the future. 

--
Jon Stergiou
Engineer
NSWC Carderock


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Tuesday, April 12, 2011 11:38
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc

Okay, that says that mpirun is working correctly - the problem appears to be in 
MPI_Init.

How was OMPI configured?


On Apr 12, 2011, at 9:24 AM, Stergiou, Jonathan C CIV NSWCCD West Bethesda, 
6640 wrote:

> Ralph,
> 
> Thanks for the reply and guidance. 
> 
> I ran the following:
> 
> $> mpirun -np 1 hostname
> XXX_TUX01
> 
> $> mpirun -np 2 hostname
> XXX_TUX01
> XXX_TUX01
> 
> $> mpirun -np 1 ./hello_c
> Hello, world, I am 0 of 1. 
> 
> $> mpirun -np 2 ./hello_c
> (no result, terminal does not respond until ctrl+c)
> 
> 
> 
>> Let's simplify the issue as we have no idea what your codes are doing. 
>> 
>> Can you run two copies of hostname, for example? 
>> 
>> What about multiple copies of an MPI version of "hello" - see the examples 
>> directory in the OMPI tarball. 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc

2011-04-12 Thread Ralph Castain
Okay, that says that mpirun is working correctly - the problem appears to be in 
MPI_Init.

How was OMPI configured?


On Apr 12, 2011, at 9:24 AM, Stergiou, Jonathan C CIV NSWCCD West Bethesda, 
6640 wrote:

> Ralph,
> 
> Thanks for the reply and guidance. 
> 
> I ran the following:
> 
> $> mpirun -np 1 hostname
> XXX_TUX01
> 
> $> mpirun -np 2 hostname
> XXX_TUX01
> XXX_TUX01
> 
> $> mpirun -np 1 ./hello_c
> Hello, world, I am 0 of 1. 
> 
> $> mpirun -np 2 ./hello_c
> (no result, terminal does not respond until ctrl+c)
> 
> 
> 
>> Let's simplify the issue as we have no idea what your codes are doing. 
>> 
>> Can you run two copies of hostname, for example? 
>> 
>> What about multiple copies of an MPI version of "hello" - see the examples 
>> directory in the OMPI tarball. 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc

2011-04-12 Thread Stergiou, Jonathan C CIV NSWCCD West Bethesda, 6640
Ralph,

Thanks for the reply and guidance. 

I ran the following:

$> mpirun -np 1 hostname
XXX_TUX01

$> mpirun -np 2 hostname
XXX_TUX01
XXX_TUX01

$> mpirun -np 1 ./hello_c
Hello, world, I am 0 of 1. 

$> mpirun -np 2 ./hello_c
(no result, terminal does not respond until ctrl+c)



> Let's simplify the issue as we have no idea what your codes are doing. 
> 
> Can you run two copies of hostname, for example? 
> 
> What about multiple copies of an MPI version of "hello" - see the examples 
> directory in the OMPI tarball. 




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc

2011-04-12 Thread Ralph Castain
Let's simplify the issue as we have no idea what your codes are doing.

Can you run two copies of hostname, for example?

What about multiple copies of an MPI version of "hello" - see the examples 
directory in the OMPI tarball.


On Apr 12, 2011, at 8:43 AM, Stergiou, Jonathan C CIV NSWCCD West Bethesda, 
6640 wrote:

> Apologies for not clarifying.  The behavior below is expected, I am just 
> checking to see that Gemini will start-up and look for its input file.  When 
> Gemini+OpenMPI is working correctly, I expect to see the behavior below. 
> 
> When Gemini+OpenMPI is not working correctly (current behavior), I see the 
> second behavior.  When running with "-np 1", Gemini will start-up and look 
> for its input file.  When running with "-np 2" (or anything more than 1), 
> Gemini never starts up.  Instead, the code simply hangs up indefinitely.  I 
> showed Gemini as an example.  I don't believe the issue is Gemini-related, as 
> I've reproduced the same "hanging" behavior with two other MPI codes 
> (Salinas, ParaDyn). 
> 
> The same codebase runs correctly on many other workstations (transferred from 
> my machine (build machine) to colleague's machine via "rsync -vrlpu 
> /opt/sierra/ targetmachine:/opt/sierra"). 
> 
> I tried the following fixes, but still have problems: 
> 
> -Copy salinas (or geminimpi) locally, run "mpirun -np 2 ./salinas"
> Tried running locally, both interactively and through queueing system.  No 
> difference in behavior. 
> 
> -Compare "ldd salinas" and "ldd gemini" with functioning examples (examples 
> from coworkers' workstations). 
> Compared "ldd salinas" output (and "ldd geminimpi") with results from other 
> workstations.  Comparisons look fine. 
> 
> -Create new user account with clean profile on my workstation.  Maybe it is 
> an environment problem. 
> Created new user account and sourced "/opt/sierra/install/sierra_init.sh" to 
> set up path.  No difference in behavior. 
> 
> -Compare /etc/profile and /etc/bashrc with "functioning" examples. 
> I compared my /etc/profile and /etc/bashrc with colleagues.  Comparisons 
> don't raise any flags. 
> 
> I can provide other diagnostic-type information as requested. 
> 
> --
> Jon Stergiou
> 
> 
> 
> -----Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Stergiou, Jonathan C CIV NSWCCD West Bethesda,6640
> Sent: Monday, April 11, 2011 9:53
> To: us...@open-mpi.org
> Subject: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc
> 
> I am running OpenMPI 1.4.2 under RHEL 5.5.  After install, I tested with 
> "mpirun -np 4 date"; the command returned four "date" outputs. 
> 
> Then I tried running two different MPI programs, "geminimpi" and "salinas".  
> Both run correctly with "mpirun -np 1 $prog".  However, both hang 
> indefinitely when I use anything other than "-np 1".  
> 
> Next, I ran "mpirun --debug-daemons -np 1 geminimpi" and got the following:  
> (this looks good, and is what I would expect)
> 
> [code]
> [xxx@XXX_TUX01 ~]$ mpirun --debug-daemons -np 1 geminimpi
> [XXX_TUX01:06558] [[15027,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200
> [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received add_local_procs
> [XXX_TUX01:06558] [[15027,0],0] orted_recv: received sync+nidmap from local 
> proc [[15027,1],0]
> [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd
> [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs
> [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd
> [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs
> Fluid Proc Ready: ID, FluidMaster,LagMaster = 001
> Checking license for Gemini
> Checking license for Linux OS
> Checking internal license list
> License valid
> 
> GEMINI Startup
> Gemini +++ Version 5.1.00  20110501 +++
> 
> + ERROR MESSAGE +
> FILE MISSING (Input): name = gemini.inp
> [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received waitpid_fired cmd
> [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received iof_complete cmd
> --
> mpirun has exited due to process rank 0 with PID 6559 on
> node XXX_TUX01 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --
> [XXX_TUX01:06558]

Re: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc

2011-04-12 Thread Stergiou, Jonathan C CIV NSWCCD West Bethesda, 6640
Apologies for not clarifying.  The behavior below is expected, I am just 
checking to see that Gemini will start-up and look for its input file.  When 
Gemini+OpenMPI is working correctly, I expect to see the behavior below. 

When Gemini+OpenMPI is not working correctly (current behavior), I see the 
second behavior.  When running with "-np 1", Gemini will start-up and look for 
its input file.  When running with "-np 2" (or anything more than 1), Gemini 
never starts up.  Instead, the code simply hangs up indefinitely.  I showed 
Gemini as an example.  I don't believe the issue is Gemini-related, as I've 
reproduced the same "hanging" behavior with two other MPI codes (Salinas, 
ParaDyn). 

The same codebase runs correctly on many other workstations (transferred from 
my machine (build machine) to colleague's machine via "rsync -vrlpu 
/opt/sierra/ targetmachine:/opt/sierra"). 

I tried the following fixes, but still have problems: 

-Copy salinas (or geminimpi) locally, run "mpirun -np 2 ./salinas"
Tried running locally, both interactively and through queueing system.  No 
difference in behavior. 

-Compare "ldd salinas" and "ldd gemini" with functioning examples (examples 
from coworkers' workstations). 
Compared "ldd salinas" output (and "ldd geminimpi") with results from other 
workstations.  Comparisons look fine. 

-Create new user account with clean profile on my workstation.  Maybe it is an 
environment problem. 
Created new user account and sourced "/opt/sierra/install/sierra_init.sh" to 
set up path.  No difference in behavior. 

-Compare /etc/profile and /etc/bashrc with "functioning" examples. 
I compared my /etc/profile and /etc/bashrc with colleagues.  Comparisons don't 
raise any flags. 

I can provide other diagnostic-type information as requested. 

--
Jon Stergiou



-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Stergiou, Jonathan C CIV NSWCCD West Bethesda,6640
Sent: Monday, April 11, 2011 9:53
To: us...@open-mpi.org
Subject: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc

I am running OpenMPI 1.4.2 under RHEL 5.5.  After install, I tested with 
"mpirun -np 4 date"; the command returned four "date" outputs. 

Then I tried running two different MPI programs, "geminimpi" and "salinas".  
Both run correctly with "mpirun -np 1 $prog".  However, both hang indefinitely 
when I use anything other than "-np 1".  

Next, I ran "mpirun --debug-daemons -np 1 geminimpi" and got the following:  
(this looks good, and is what I would expect)

[code]
[xxx@XXX_TUX01 ~]$ mpirun --debug-daemons -np 1 geminimpi
[XXX_TUX01:06558] [[15027,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received add_local_procs
[XXX_TUX01:06558] [[15027,0],0] orted_recv: received sync+nidmap from local 
proc [[15027,1],0]
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs
Fluid Proc Ready: ID, FluidMaster,LagMaster = 001
 Checking license for Gemini
 Checking license for Linux OS
 Checking internal license list
 License valid

 GEMINI Startup
 Gemini +++ Version 5.1.00  20110501 +++

 + ERROR MESSAGE +
 FILE MISSING (Input): name = gemini.inp
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received waitpid_fired cmd
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received iof_complete cmd
--
mpirun has exited due to process rank 0 with PID 6559 on
node XXX_TUX01 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received exit
[/code]

With "mpirun --debug-daemons -np 2 geminimpi", it hangs like this: (hangs 
indefinitely)

[code]
[xxx@XXX_TUX01 ~]$ mpirun --debug-daemons -np 2 geminimpi
[XXX_TUX01:06570] [[14983,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received add_local_procs
[XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local 
proc [[14983,1],1]
[XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local 
proc [[14983,1],0]
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received message_local_procs
[/code]


I cl

Re: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc

2011-04-12 Thread Jeff Squyres
On Apr 11, 2011, at 9:53 AM, Stergiou, Jonathan C CIV NSWCCD West Bethesda, 
6640 wrote:

> + ERROR MESSAGE +
> FILE MISSING (Input): name = gemini.inp

This seems like a gemini error, not an Open MPI error.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc

2011-04-11 Thread Stergiou, Jonathan C CIV NSWCCD West Bethesda, 6640
I am running OpenMPI 1.4.2 under RHEL 5.5.  After install, I tested with 
"mpirun -np 4 date"; the command returned four "date" outputs. 

Then I tried running two different MPI programs, "geminimpi" and "salinas".  
Both run correctly with "mpirun -np 1 $prog".  However, both hang indefinitely 
when I use anything other than "-np 1".  

Next, I ran "mpirun --debug-daemons -np 1 geminimpi" and got the following:  
(this looks good, and is what I would expect)

[code]
[xxx@XXX_TUX01 ~]$ mpirun --debug-daemons -np 1 geminimpi
[XXX_TUX01:06558] [[15027,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received add_local_procs
[XXX_TUX01:06558] [[15027,0],0] orted_recv: received sync+nidmap from local 
proc [[15027,1],0]
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs
Fluid Proc Ready: ID, FluidMaster,LagMaster = 001
 Checking license for Gemini
 Checking license for Linux OS
 Checking internal license list
 License valid

 GEMINI Startup
 Gemini +++ Version 5.1.00  20110501 +++

 + ERROR MESSAGE +
 FILE MISSING (Input): name = gemini.inp
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received waitpid_fired cmd
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received iof_complete cmd
--
mpirun has exited due to process rank 0 with PID 6559 on
node XXX_TUX01 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received exit
[/code]

With "mpirun --debug-daemons -np 2 geminimpi", it hangs like this: (hangs 
indefinitely)

[code]
[xxx@XXX_TUX01 ~]$ mpirun --debug-daemons -np 2 geminimpi
[XXX_TUX01:06570] [[14983,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received add_local_procs
[XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local 
proc [[14983,1],1]
[XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local 
proc [[14983,1],0]
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received message_local_procs
[/code]


I cloned my entire installation to a number of other machines to test.  On all 
the other workstations, everything behaves correctly and various regression 
suites return good results. 

Any ideas? 

--
Jon Stergiou
Engineer
NSWC Carderock