Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Well, it now is launching just fine, so that's one thing! :-)

Afraid I'll have to let the TCP btl guys take over from here. It looks like
everything is up and running, but something strange is going on in the MPI
comm layer.

You can turn off those mca params I gave you as you are now past that point.
I know there are others that can help debug that TCP btl error, but they can
help you there.

Ralph


On Tue, Aug 11, 2009 at 8:54 AM, Klymak Jody  wrote:

>
> On 11-Aug-09, at 6:16 AM, Jeff Squyres wrote:
>
>  This means that OMPI is finding an mca_iof_proxy.la file at run time from
>> a prior version of Open MPI.  You might want to use "find" or "locate" to
>> search your nodes and find it.  I suspect that you somehow have an OMPI
>> 1.3.x install that overlaid an install of a prior OMPI version installation.
>>
>
>
> OK, right you were - the old file was in my new install directory.  I
> didn't erase /usr/local/openmpi before re-running the install...
>
> However, after reinstalling on the nodes (but not cleaning out /usr/lib on
> all the nodes) I still have the following:
>
> Thanks,  Jody
>
>
> saturna.cluster:17660] mca:base:select:(  plm) Querying component [rsh]
> [saturna.cluster:17660] mca:base:select:(  plm) Query of component [rsh]
> set priority to 10
> [saturna.cluster:17660] mca:base:select:(  plm) Querying component [slurm]
> [saturna.cluster:17660] mca:base:select:(  plm) Skipping component [slurm].
> Query failed to return a module
> [saturna.cluster:17660] mca:base:select:(  plm) Querying component [tm]
> [saturna.cluster:17660] mca:base:select:(  plm) Skipping component [tm].
> Query failed to return a module
> [saturna.cluster:17660] mca:base:select:(  plm) Querying component [xgrid]
> [saturna.cluster:17660] mca:base:select:(  plm) Skipping component [xgrid].
> Query failed to return a module
> [saturna.cluster:17660] mca:base:select:(  plm) Selected component [rsh]
> [saturna.cluster:17660] plm:base:set_hnp_name: initial bias 17660 nodename
> hash 1656374957
> [saturna.cluster:17660] plm:base:set_hnp_name: final jobfam 24811
> [saturna.cluster:17660] [[24811,0],0] plm:base:receive start comm
> [saturna.cluster:17660] mca:base:select:( odls) Querying component
> [default]
> [saturna.cluster:17660] mca:base:select:( odls) Query of component
> [default] set priority to 1
> [saturna.cluster:17660] mca:base:select:( odls) Selected component
> [default]
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: setting up job [24811,1]
> [saturna.cluster:17660] [[24811,0],0] plm:base:setup_job for job [24811,1]
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: local shell: 0 (bash)
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: assuming same remote shell
> as local shell
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: remote shell: 0 (bash)
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: final template argv:
>/usr/bin/ssh   PATH=/usr/local/openmpi/bin:$PATH ; export
> PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH ; export
> LD_LIBRARY_PATH ;  /usr/local/openmpi/bin/orted --debug-daemons -mca ess env
> -mca orte_ess_jobid 1626013696 -mca orte_ess_vpid  -mca
> orte_ess_num_procs 3 --hnp-uri "1626013696.0;tcp://142.104.154.96:49710
> ;tcp://192.168.2.254:49710" -mca plm_base_verbose 5 -mca odls_base_verbose
> 5
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: launching on node xserve01
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: recording launch of daemon
> [[24811,0],1]
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: executing: (//usr/bin/ssh)
> [/usr/bin/ssh xserve01  PATH=/usr/local/openmpi/bin:$PATH ; export PATH ;
> LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH ; export
> LD_LIBRARY_PATH ;  /usr/local/openmpi/bin/orted --debug-daemons -mca ess env
> -mca orte_ess_jobid 1626013696 -mca orte_ess_vpid 1 -mca orte_ess_num_procs
> 3 --hnp-uri "1626013696.0;tcp://142.104.154.96:49710;tcp://
> 192.168.2.254:49710" -mca plm_base_verbose 5 -mca odls_base_verbose 5]
> Daemon was launched on xserve01.cluster - beginning to initialize
> [xserve01.cluster:42519] mca:base:select:( odls) Querying component
> [default]
> [xserve01.cluster:42519] mca:base:select:( odls) Query of component
> [default] set priority to 1
> [xserve01.cluster:42519] mca:base:select:( odls) Selected component
> [default]
> Daemon [[24811,0],1] checking in as pid 42519 on host xserve01.cluster
> Daemon [[24811,0],1] not using static ports
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: launching on node xserve02
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: recording launch of daemon
> [[24811,0],2]
> [saturna.cluster:17660] [[24811,0],0] plm:rsh: executing: (//usr/bin/ssh)
> [/usr/bin/ssh xserve02  PATH=/usr/local/openmpi/bin:$PATH ; export PATH ;
> LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH ; export
> LD_LIBRARY_PATH ;  /usr/local/openmpi/bin/orted --debug-daemons -mca ess env
> -mca orte_ess_jobid 1626013696 -mca orte_ess_vpid 2 -mca orte_ess_num_procs
> 3 

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Yeah, it's the lib confusion that's the problem - this is the problem:

[saturna.cluster:07360] [[14551,0],0] ORTE_ERROR_LOG: Buffer type (described
> vs non-described) mismatch - operation not allowed in file
> base/odls_base_default_fns.c at line 2475
>

Have you tried configuring with --enable-mpirun-prefix-by-default? This
would help avoid the confusion. You also should check your path to ensure
that it is correct as well (make sure that mpirun is the one you expect, and
that you are getting the corresponding remote orted).

Ralph

On Tue, Aug 11, 2009 at 8:23 AM, Klymak Jody  wrote:

>
> On 11-Aug-09, at 7:03 AM, Ralph Castain wrote:
>
> Sigh - too early in the morning for this old brain, I fear...
>
> You are right - the ranks are fine, and local rank doesn't matter. It
> sounds like a problem where the TCP messaging is getting a message ack'd
> from someone other than the process that was supposed to recv the message.
> This should cause us to abort, but we were just talking on the phone that
> the abort procedure may not be working correctly. Or it could be (as Jeff
> suggests) that the version mismatch is also preventing us from properly
> aborting too.
>
> So I fear we are back to trying to find these other versions on your
> nodes...
>
>
> Well, the old version is still on the nodes (in /usr/lib as default for OS
> X)...
>
> I can try and clean those all out by hand but I'm still confused why the
> old version would be used - how does openMPI find the right library?
>
> Note again, that I get these MCA warnings on the server when just running
> ompi_info and I *have* cleaned out /usr/lib on the server.  So I really
> don't understand how on the server I can still have a library issue.  Is
> there a way to trace at runtime what library an executable is dynamically
> linking to?  Can I rebuild openmpi statically?
>
> Thanks,  Jody
>
>
>
>
> On Tue, Aug 11, 2009 at 7:43 AM, Klymak Jody  wrote:
>
>>
>> On 11-Aug-09, at 6:28 AM, Ralph Castain wrote:
>>
>>  The reason your job is hanging is sitting in the orte-ps output. You have
>>> multiple processes declaring themselves to be the same MPI rank. That
>>> definitely won't work.
>>>
>>
>> Its the "local rank" if that makes any difference...
>>
>> Any thoughts on this output?
>>
>> [xserve03.local][[61029,1],4][btl_tcp_endpoint.c:486:mca_btl_tcp_endpoint_recv_connect_ack]
>> received unexpected process identifier [[61029,1],3]
>>
>>  The question is why is that happening? We use Torque all the time, so we
>>> know that the basic support is correct. It -could- be related to lib
>>> confusion, but I can't tell for sure.
>>>
>>
>> Just to be clear, this is not going through torque at this point.  Its
>> just vanilla ssh, for which this code worked with 1.1.5.
>>
>>
>>  Can you rebuild OMPI with --enable-debug, and rerun the job with the
>>> following added to your cmd line?
>>>
>>> -mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5
>>>
>>
>> Working on this...
>>
>> Thanks,  Jody
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody


On 11-Aug-09, at 6:16 AM, Jeff Squyres wrote:

This means that OMPI is finding an mca_iof_proxy.la file at run time  
from a prior version of Open MPI.  You might want to use "find" or  
"locate" to search your nodes and find it.  I suspect that you  
somehow have an OMPI 1.3.x install that overlaid an install of a  
prior OMPI version installation.



OK, right you were - the old file was in my new install directory.  I  
didn't erase /usr/local/openmpi before re-running the install...


However, after reinstalling on the nodes (but not cleaning out /usr/ 
lib on all the nodes) I still have the following:


Thanks,  Jody


saturna.cluster:17660] mca:base:select:(  plm) Querying component [rsh]
[saturna.cluster:17660] mca:base:select:(  plm) Query of component  
[rsh] set priority to 10
[saturna.cluster:17660] mca:base:select:(  plm) Querying component  
[slurm]
[saturna.cluster:17660] mca:base:select:(  plm) Skipping component  
[slurm]. Query failed to return a module

[saturna.cluster:17660] mca:base:select:(  plm) Querying component [tm]
[saturna.cluster:17660] mca:base:select:(  plm) Skipping component  
[tm]. Query failed to return a module
[saturna.cluster:17660] mca:base:select:(  plm) Querying component  
[xgrid]
[saturna.cluster:17660] mca:base:select:(  plm) Skipping component  
[xgrid]. Query failed to return a module

[saturna.cluster:17660] mca:base:select:(  plm) Selected component [rsh]
[saturna.cluster:17660] plm:base:set_hnp_name: initial bias 17660  
nodename hash 1656374957

[saturna.cluster:17660] plm:base:set_hnp_name: final jobfam 24811
[saturna.cluster:17660] [[24811,0],0] plm:base:receive start comm
[saturna.cluster:17660] mca:base:select:( odls) Querying component  
[default]
[saturna.cluster:17660] mca:base:select:( odls) Query of component  
[default] set priority to 1
[saturna.cluster:17660] mca:base:select:( odls) Selected component  
[default]

[saturna.cluster:17660] [[24811,0],0] plm:rsh: setting up job [24811,1]
[saturna.cluster:17660] [[24811,0],0] plm:base:setup_job for job  
[24811,1]

[saturna.cluster:17660] [[24811,0],0] plm:rsh: local shell: 0 (bash)
[saturna.cluster:17660] [[24811,0],0] plm:rsh: assuming same remote  
shell as local shell

[saturna.cluster:17660] [[24811,0],0] plm:rsh: remote shell: 0 (bash)
[saturna.cluster:17660] [[24811,0],0] plm:rsh: final template argv:
	/usr/bin/ssh   PATH=/usr/local/openmpi/bin:$PATH ; export  
PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH ;  
export LD_LIBRARY_PATH ;  /usr/local/openmpi/bin/orted --debug-daemons  
-mca ess env -mca orte_ess_jobid 1626013696 -mca orte_ess_vpid  
 -mca orte_ess_num_procs 3 --hnp-uri "1626013696.0;tcp:// 
142.104.154.96:49710;tcp://192.168.2.254:49710" -mca plm_base_verbose  
5 -mca odls_base_verbose 5
[saturna.cluster:17660] [[24811,0],0] plm:rsh: launching on node  
xserve01
[saturna.cluster:17660] [[24811,0],0] plm:rsh: recording launch of  
daemon [[24811,0],1]
[saturna.cluster:17660] [[24811,0],0] plm:rsh: executing: (//usr/bin/ 
ssh) [/usr/bin/ssh xserve01  PATH=/usr/local/openmpi/bin:$PATH ;  
export PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib: 
$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ;  /usr/local/openmpi/bin/ 
orted --debug-daemons -mca ess env -mca orte_ess_jobid 1626013696 -mca  
orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri  
"1626013696.0;tcp://142.104.154.96:49710;tcp://192.168.2.254:49710" - 
mca plm_base_verbose 5 -mca odls_base_verbose 5]

Daemon was launched on xserve01.cluster - beginning to initialize
[xserve01.cluster:42519] mca:base:select:( odls) Querying component  
[default]
[xserve01.cluster:42519] mca:base:select:( odls) Query of component  
[default] set priority to 1
[xserve01.cluster:42519] mca:base:select:( odls) Selected component  
[default]

Daemon [[24811,0],1] checking in as pid 42519 on host xserve01.cluster
Daemon [[24811,0],1] not using static ports
[saturna.cluster:17660] [[24811,0],0] plm:rsh: launching on node  
xserve02
[saturna.cluster:17660] [[24811,0],0] plm:rsh: recording launch of  
daemon [[24811,0],2]
[saturna.cluster:17660] [[24811,0],0] plm:rsh: executing: (//usr/bin/ 
ssh) [/usr/bin/ssh xserve02  PATH=/usr/local/openmpi/bin:$PATH ;  
export PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib: 
$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ;  /usr/local/openmpi/bin/ 
orted --debug-daemons -mca ess env -mca orte_ess_jobid 1626013696 -mca  
orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri  
"1626013696.0;tcp://142.104.154.96:49710;tcp://192.168.2.254:49710" - 
mca plm_base_verbose 5 -mca odls_base_verbose 5]

Daemon was launched on xserve02.local - beginning to initialize
[xserve02.local:42180] mca:base:select:( odls) Querying component  
[default]
[xserve02.local:42180] mca:base:select:( odls) Query of component  
[default] set priority to 1
[xserve02.local:42180] mca:base:select:( odls) Selected component  
[default]

Daemon [[24811,0],2] checking in as pid 42180 on host xserve02.local
Daemon [[24811,0],2] not using 

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody


On 11-Aug-09, at 7:03 AM, Ralph Castain wrote:


Sigh - too early in the morning for this old brain, I fear...

You are right - the ranks are fine, and local rank doesn't matter.  
It sounds like a problem where the TCP messaging is getting a  
message ack'd from someone other than the process that was supposed  
to recv the message. This should cause us to abort, but we were just  
talking on the phone that the abort procedure may not be working  
correctly. Or it could be (as Jeff suggests) that the version  
mismatch is also preventing us from properly aborting too.


So I fear we are back to trying to find these other versions on your  
nodes...


Well, the old version is still on the nodes (in /usr/lib as default  
for OS X)...


I can try and clean those all out by hand but I'm still confused why  
the old version would be used - how does openMPI find the right library?


Note again, that I get these MCA warnings on the server when just  
running ompi_info and I *have* cleaned out /usr/lib on the server.  So  
I really don't understand how on the server I can still have a library  
issue.  Is there a way to trace at runtime what library an executable  
is dynamically linking to?  Can I rebuild openmpi statically?


Thanks,  Jody





On Tue, Aug 11, 2009 at 7:43 AM, Klymak Jody  wrote:

On 11-Aug-09, at 6:28 AM, Ralph Castain wrote:

The reason your job is hanging is sitting in the orte-ps output. You  
have multiple processes declaring themselves to be the same MPI  
rank. That definitely won't work.


Its the "local rank" if that makes any difference...

Any thoughts on this output?


[xserve03.local][[61029,1],4][btl_tcp_endpoint.c: 
486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected  
process identifier [[61029,1],3]


The question is why is that happening? We use Torque all the time,  
so we know that the basic support is correct. It -could- be related  
to lib confusion, but I can't tell for sure.


Just to be clear, this is not going through torque at this point.   
Its just vanilla ssh, for which this code worked with 1.1.5.




Can you rebuild OMPI with --enable-debug, and rerun the job with the  
following added to your cmd line?


-mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5

Working on this...

Thanks,  Jody

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody


On 11-Aug-09, at 6:28 AM, Ralph Castain wrote:


-mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5

I'm afraid the output will be a tad verbose, but I would appreciate  
seeing it. Might also tell us something about the lib issue.



Command line was:

/usr/local/openmpi/bin/mpirun -mca plm_base_verbose 5 --debug-daemons - 
mca odls_base_verbose 5 -n 16 --host xserve03,xserve04 ../build/mitgcmuv



Starting: ../results//TasGaussRestart16
[saturna.cluster:07360] mca:base:select:(  plm) Querying component [rsh]
[saturna.cluster:07360] mca:base:select:(  plm) Query of component  
[rsh] set priority to 10
[saturna.cluster:07360] mca:base:select:(  plm) Querying component  
[slurm]
[saturna.cluster:07360] mca:base:select:(  plm) Skipping component  
[slurm]. Query failed to return a module

[saturna.cluster:07360] mca:base:select:(  plm) Querying component [tm]
[saturna.cluster:07360] mca:base:select:(  plm) Skipping component  
[tm]. Query failed to return a module
[saturna.cluster:07360] mca:base:select:(  plm) Querying component  
[xgrid]
[saturna.cluster:07360] mca:base:select:(  plm) Skipping component  
[xgrid]. Query failed to return a module

[saturna.cluster:07360] mca:base:select:(  plm) Selected component [rsh]
[saturna.cluster:07360] plm:base:set_hnp_name: initial bias 7360  
nodename hash 1656374957

[saturna.cluster:07360] plm:base:set_hnp_name: final jobfam 14551
[saturna.cluster:07360] [[14551,0],0] plm:base:receive start comm
[saturna.cluster:07360] mca: base: component_find: ras  
"mca_ras_dash_host" uses an MCA interface that is not recognized  
(component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:07360] mca: base: component_find: ras  
"mca_ras_hostfile" uses an MCA interface that is not recognized  
(component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:07360] mca: base: component_find: ras  
"mca_ras_localhost" uses an MCA interface that is not recognized  
(component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:07360] mca: base: component_find: ras "mca_ras_xgrid"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  
supported MCA v2.0.0) -- ignored
[saturna.cluster:07360] mca:base:select:( odls) Querying component  
[default]
[saturna.cluster:07360] mca:base:select:( odls) Query of component  
[default] set priority to 1
[saturna.cluster:07360] mca:base:select:( odls) Selected component  
[default]
[saturna.cluster:07360] mca: base: component_find: iof "mca_iof_proxy"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  
supported MCA v2.0.0) -- ignored
[saturna.cluster:07360] mca: base: component_find: iof "mca_iof_svc"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  
supported MCA v2.0.0) -- ignored

[saturna.cluster:07360] [[14551,0],0] plm:rsh: setting up job [14551,1]
[saturna.cluster:07360] [[14551,0],0] plm:base:setup_job for job  
[14551,1]

[saturna.cluster:07360] [[14551,0],0] plm:rsh: local shell: 0 (bash)
[saturna.cluster:07360] [[14551,0],0] plm:rsh: assuming same remote  
shell as local shell

[saturna.cluster:07360] [[14551,0],0] plm:rsh: remote shell: 0 (bash)
[saturna.cluster:07360] [[14551,0],0] plm:rsh: final template argv:
	/usr/bin/ssh   PATH=/usr/local/openmpi/bin:$PATH ; export  
PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH ;  
export LD_LIBRARY_PATH ;  /usr/local/openmpi/bin/orted --debug-daemons  
-mca ess env -mca orte_ess_jobid 953614336 -mca orte_ess_vpid  
 -mca orte_ess_num_procs 3 --hnp-uri "953614336.0;tcp:// 
142.104.154.96:49622;tcp://192.168.2.254:49622" -mca plm_base_verbose  
5 -mca odls_base_verbose 5
[saturna.cluster:07360] [[14551,0],0] plm:rsh: launching on node  
xserve03
[saturna.cluster:07360] [[14551,0],0] plm:rsh: recording launch of  
daemon [[14551,0],1]
[saturna.cluster:07360] [[14551,0],0] plm:rsh: executing: (//usr/bin/ 
ssh) [/usr/bin/ssh xserve03  PATH=/usr/local/openmpi/bin:$PATH ;  
export PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib: 
$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ;  /usr/local/openmpi/bin/ 
orted --debug-daemons -mca ess env -mca orte_ess_jobid 953614336 -mca  
orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri "953614336.0;tcp:// 
142.104.154.96:49622;tcp://192.168.2.254:49622" -mca plm_base_verbose  
5 -mca odls_base_verbose 5]

Daemon was launched on xserve03.local - beginning to initialize
[xserve03.local:40708] mca:base:select:( odls) Querying component  
[default]
[xserve03.local:40708] mca:base:select:( odls) Query of component  
[default] set priority to 1
[xserve03.local:40708] mca:base:select:( odls) Selected component  
[default]
[xserve03.local:40708] mca: base: component_find: iof "mca_iof_proxy"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  
supported MCA v2.0.0) -- ignored
[xserve03.local:40708] mca: base: component_find: iof "mca_iof_svc"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Sigh - too early in the morning for this old brain, I fear...

You are right - the ranks are fine, and local rank doesn't matter. It sounds
like a problem where the TCP messaging is getting a message ack'd from
someone other than the process that was supposed to recv the message. This
should cause us to abort, but we were just talking on the phone that the
abort procedure may not be working correctly. Or it could be (as Jeff
suggests) that the version mismatch is also preventing us from properly
aborting too.

So I fear we are back to trying to find these other versions on your
nodes...


On Tue, Aug 11, 2009 at 7:43 AM, Klymak Jody  wrote:

>
> On 11-Aug-09, at 6:28 AM, Ralph Castain wrote:
>
>  The reason your job is hanging is sitting in the orte-ps output. You have
>> multiple processes declaring themselves to be the same MPI rank. That
>> definitely won't work.
>>
>
> Its the "local rank" if that makes any difference...
>
> Any thoughts on this output?
>
> [xserve03.local][[61029,1],4][btl_tcp_endpoint.c:486:mca_btl_tcp_endpoint_recv_connect_ack]
> received unexpected process identifier [[61029,1],3]
>
>  The question is why is that happening? We use Torque all the time, so we
>> know that the basic support is correct. It -could- be related to lib
>> confusion, but I can't tell for sure.
>>
>
> Just to be clear, this is not going through torque at this point.  Its just
> vanilla ssh, for which this code worked with 1.1.5.
>
>
>  Can you rebuild OMPI with --enable-debug, and rerun the job with the
>> following added to your cmd line?
>>
>> -mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5
>>
>
> Working on this...
>
> Thanks,  Jody
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Jeff Squyres

On Aug 11, 2009, at 9:43 AM, Klymak Jody wrote:


[xserve03.local][[61029,1],4][btl_tcp_endpoint.c:
486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process
identifier [[61029,1],3]



This would well be caused by a version mismatch between your nodes.   
E.g., if one node is running OMPI vx.y.z and another is running  
va.b.c.  We don't check for version mismatch in network  
communications, and our wire protocols have changed between versions.   
So if vx.y.z sends something that is not understood between va.b.c,  
something like the above message could occur.


--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody


On 11-Aug-09, at 6:28 AM, Ralph Castain wrote:

The reason your job is hanging is sitting in the orte-ps output. You  
have multiple processes declaring themselves to be the same MPI  
rank. That definitely won't work.


Its the "local rank" if that makes any difference...

Any thoughts on this output?

[xserve03.local][[61029,1],4][btl_tcp_endpoint.c: 
486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process  
identifier [[61029,1],3]


The question is why is that happening? We use Torque all the time,  
so we know that the basic support is correct. It -could- be related  
to lib confusion, but I can't tell for sure.


Just to be clear, this is not going through torque at this point.  Its  
just vanilla ssh, for which this code worked with 1.1.5.



Can you rebuild OMPI with --enable-debug, and rerun the job with the  
following added to your cmd line?


-mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5


Working on this...

Thanks,  Jody


Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Oops - I should have looked at your output more closely. The component_find
warnings are clearly indicating some old libs laying around, but that isn't
why your job is hanging.

The reason your job is hanging is sitting in the orte-ps output. You have
multiple processes declaring themselves to be the same MPI rank. That
definitely won't work.

The question is why is that happening? We use Torque all the time, so we
know that the basic support is correct. It -could- be related to lib
confusion, but I can't tell for sure.

Can you rebuild OMPI with --enable-debug, and rerun the job with the
following added to your cmd line?

-mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5

I'm afraid the output will be a tad verbose, but I would appreciate seeing
it. Might also tell us something about the lib issue.

Thanks
Ralph


On Tue, Aug 11, 2009 at 7:22 AM, Ralph Castain  wrote:

> Sorry, but Jeff is correct - that error message clearly indicates a version
> mismatch. Somewhere, one or more of your nodes is still picking up an old
> version.
>
>
>
>
> On Tue, Aug 11, 2009 at 7:16 AM, Jeff Squyres  wrote:
>
>> On Aug 11, 2009, at 9:11 AM, Klymak Jody wrote:
>>
>>  I have removed all the OS-X -supplied libraries, recompiled and
>>> installed openmpi 1.3.3, and I am *still* getting this warning when
>>> running ompi_info:
>>>
>>> [saturna.cluster:50307] mca: base: component_find: iof "mca_iof_proxy"
>>> uses an MCA interface that is not recognized (component MCA v1.0.0 !=
>>> supported MCA v2.0.0) -- ignored
>>>
>>>
>> This means that OMPI is finding an mca_iof_proxy.la file at run time from
>> a prior version of Open MPI.  You might want to use "find" or "locate" to
>> search your nodes and find it.  I suspect that you somehow have an OMPI
>> 1.3.x install that overlaid an install of a prior OMPI version installation.
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>


Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Sorry, but Jeff is correct - that error message clearly indicates a version
mismatch. Somewhere, one or more of your nodes is still picking up an old
version.



On Tue, Aug 11, 2009 at 7:16 AM, Jeff Squyres  wrote:

> On Aug 11, 2009, at 9:11 AM, Klymak Jody wrote:
>
>  I have removed all the OS-X -supplied libraries, recompiled and
>> installed openmpi 1.3.3, and I am *still* getting this warning when
>> running ompi_info:
>>
>> [saturna.cluster:50307] mca: base: component_find: iof "mca_iof_proxy"
>> uses an MCA interface that is not recognized (component MCA v1.0.0 !=
>> supported MCA v2.0.0) -- ignored
>>
>>
> This means that OMPI is finding an mca_iof_proxy.la file at run time from
> a prior version of Open MPI.  You might want to use "find" or "locate" to
> search your nodes and find it.  I suspect that you somehow have an OMPI
> 1.3.x install that overlaid an install of a prior OMPI version installation.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Jeff Squyres

On Aug 11, 2009, at 9:11 AM, Klymak Jody wrote:


I have removed all the OS-X -supplied libraries, recompiled and
installed openmpi 1.3.3, and I am *still* getting this warning when
running ompi_info:

[saturna.cluster:50307] mca: base: component_find: iof "mca_iof_proxy"
uses an MCA interface that is not recognized (component MCA v1.0.0 !=
supported MCA v2.0.0) -- ignored



This means that OMPI is finding an mca_iof_proxy.la file at run time  
from a prior version of Open MPI.  You might want to use "find" or  
"locate" to search your nodes and find it.  I suspect that you somehow  
have an OMPI 1.3.x install that overlaid an install of a prior OMPI  
version installation.


--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody


On 10-Aug-09, at 8:03 PM, Ralph Castain wrote:

Interesting! Well, I always make sure I have my personal OMPI build  
before any system stuff, and I work exclusively on Mac OS-X:


I am still finding this very mysterious

I have removed all the OS-X -supplied libraries, recompiled and  
installed openmpi 1.3.3, and I am *still* getting this warning when  
running ompi_info:


[saturna.cluster:50307] mca: base: component_find: iof "mca_iof_proxy"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  
supported MCA v2.0.0) -- ignored
[saturna.cluster:50307] mca: base: component_find: iof "mca_iof_svc"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  
supported MCA v2.0.0) -- ignored
[saturna.cluster:50307] mca: base: component_find: ras  
"mca_ras_dash_host" uses an MCA interface that is not recognized  
(component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:50307] mca: base: component_find: ras  
"mca_ras_hostfile" uses an MCA interface that is not recognized  
(component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:50307] mca: base: component_find: ras  
"mca_ras_localhost" uses an MCA interface that is not recognized  
(component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:50307] mca: base: component_find: ras "mca_ras_xgrid"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  
supported MCA v2.0.0) -- ignored
[saturna.cluster:50307] mca: base: component_find: rcache  
"mca_rcache_rb" uses an MCA interface that is not recognized  
(component MCA v1.0.0 != supported MCA v2.0.0) -- ignored


So, I guess I'm not clear how the library can be an issue...

I *do* get another error from running the gcm that I do not get from  
running simpler jobs - hopefully this helps explain things:


[xserve03.local][[61029,1],4][btl_tcp_endpoint.c: 
486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process  
identifier [[61029,1],3]


The processes are running, the mitgcmuv processes are running on the  
xserves, and using considerable resources!  They open STDERR/STDOUT  
but nothing is flushed into them, including the few print statements  
I've put in before and after MPI_INIT (as Ralph suggested).


On 11-Aug-09, at 4:17 AM, Ashley Pittman wrote:

If you suspect a hang then you can use the command orte-ps (on the  
node
where the mpirun is running) and it should show you your job.  This  
will

tell you if the job is started and still running or if there was a
problem launching.


/usr/local/openmpi/bin/orte-ps
[saturna.cluster:51840] mca: base: component_find: iof "mca_iof_proxy"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  
supported MCA v2.0.0) -- ignored
[saturna.cluster:51840] mca: base: component_find: iof "mca_iof_svc"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  
supported MCA v2.0.0) -- ignored



Information from mpirun [61029,0]
---

JobID |   State |  Slots | Num Procs |
--
[61029,1] | Running |  2 |16 |
	 Process Name |  ORTE Name | Local Rank |PID | Node  
|   State |


---
	../build/mitgcmuv |  [[61029,1],0] |  0 |  40206 | xserve03 |  
Running |
	../build/mitgcmuv |  [[61029,1],1] |  0 |  40005 | xserve04 |  
Running |
	../build/mitgcmuv |  [[61029,1],2] |  1 |  40207 | xserve03 |  
Running |
	../build/mitgcmuv |  [[61029,1],3] |  1 |  40006 | xserve04 |  
Running |
	../build/mitgcmuv |  [[61029,1],4] |  2 |  40208 | xserve03 |  
Running |
	../build/mitgcmuv |  [[61029,1],5] |  2 |  40007 | xserve04 |  
Running |
	../build/mitgcmuv |  [[61029,1],6] |  3 |  40209 | xserve03 |  
Running |
	../build/mitgcmuv |  [[61029,1],7] |  3 |  40008 | xserve04 |  
Running |
	../build/mitgcmuv |  [[61029,1],8] |  4 |  40210 | xserve03 |  
Running |
	../build/mitgcmuv |  [[61029,1],9] |  4 |  40009 | xserve04 |  
Running |
	../build/mitgcmuv | [[61029,1],10] |  5 |  40211 | xserve03 |  
Running |
	../build/mitgcmuv | [[61029,1],11] |  5 |  40010 | xserve04 |  
Running |
	../build/mitgcmuv | [[61029,1],12] |  6 |  40212 | xserve03 |  
Running |
	../build/mitgcmuv | [[61029,1],13] |  6 |  40011 | xserve04 |  
Running |
	../build/mitgcmuv | [[61029,1],14] |  7 |  40213 | xserve03 |  
Running |
	../build/mitgcmuv | [[61029,1],15] |  7 |  40012 | xserve04 |  
Running |


Thanks,  Jody





Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain


On Aug 11, 2009, at 5:17 AM, Ashley Pittman wrote:


On Tue, 2009-08-11 at 03:03 -0600, Ralph Castain wrote:

If it isn't already there, try putting a print statement tight at
program start, another just prior to MPI_Init, and another just after
MPI_Init. It could be that something is hanging somewhere during
program startup since it sounds like everything is launching just
fine.


If you suspect a hang then you can use the command orte-ps (on the  
node
where the mpirun is running) and it should show you your job.  This  
will

tell you if the job is started and still running or if there was a
problem launching.

If the program did start and has really hung then you can get more
in-depth information about it using padb which is linked to in my
signature.


FWIW: we use padb for this purpose, and it is very helpful!

Ralph



Ashley,

--

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ashley Pittman
On Tue, 2009-08-11 at 03:03 -0600, Ralph Castain wrote:
> If it isn't already there, try putting a print statement tight at
> program start, another just prior to MPI_Init, and another just after
> MPI_Init. It could be that something is hanging somewhere during
> program startup since it sounds like everything is launching just
> fine.

If you suspect a hang then you can use the command orte-ps (on the node
where the mpirun is running) and it should show you your job.  This will
tell you if the job is started and still running or if there was a
problem launching.

If the program did start and has really hung then you can get more
in-depth information about it using padb which is linked to in my
signature.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody


On 10-Aug-09, at 8:03 PM, Ralph Castain wrote:

Interesting! Well, I always make sure I have my personal OMPI build  
before any system stuff, and I work exclusively on Mac OS-X:


Note that I always configure with --prefix=somewhere-in-my-own-dir,  
never to a system directory. Avoids this kind of confusion.


Yeah, I did configure --prefix=/usr/local/openmpi

What the errors are saying is that we are picking up components from  
a very old version of OMPI that is distributed by Apple. It may or  
may not be causing confusion for the system - hard to tell. However,  
the fact that it is the IO forwarding subsystem that is picking them  
up, and the fact that you aren't seeing any output from your job,  
makes me a tad suspicious.


Me too!

Can you run other jobs? In other words, do you get stdout/stderr  
from other programs you run, or does every MPI program hang (even  
simple ones)? If it is just your program, then it could just be that  
your application is hanging before any output is generated. Can you  
have it print something to stderr right when it starts?


No simple ones, like the examples I gave before, run fine, just with  
the suspicious warnings.


I'm running a big general circulation model (MITgcm).  Under normal  
conditions it spits something out almost right away, and that is not  
being done here.  STDOUT.0001 etc are all opened, but nothing is put  
into them.


I'm pretty sure I'm compliling the gcm properly:

otool -L mitgcmuv
mitgcmuv:
	/usr/local/openmpi/lib/libmpi_f77.0.dylib (compatibility version  
1.0.0, current version 1.0.0)
	/usr/local/openmpi/lib/libmpi.0.dylib (compatibility version 1.0.0,  
current version 1.0.0)
	/usr/local/openmpi/lib/libopen-rte.0.dylib (compatibility version  
1.0.0, current version 1.0.0)
	/usr/local/openmpi/lib/libopen-pal.0.dylib (compatibility version  
1.0.0, current version 1.0.0)
	/usr/lib/libutil.dylib (compatibility version 1.0.0, current version  
1.0.0)
	/usr/local/lib/libgfortran.3.dylib (compatibility version 4.0.0,  
current version 4.0.0)
	/usr/local/lib/libgcc_s.1.dylib (compatibility version 1.0.0, current  
version 1.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current  
version 111.1.3)


Thanks,  Jody




On Aug 10, 2009, at 8:53 PM, Klymak Jody wrote:



On 10-Aug-09, at 6:44 PM, Ralph Castain wrote:

Check your LD_LIBRARY_PATH - there is an earlier version of OMPI  
in your path that is interfering with operation (i.e., it comes  
before your 1.3.3 installation).


H, The OS X faq says not to do this:

"Note that there is no need to add Open MPI's libdir to  
LD_LIBRARY_PATH; Open MPI's shared library build process  
automatically uses the "rpath" mechanism to automatically find the  
correct shared libraries (i.e., the ones associated with this  
build, vs., for example, the OS X-shipped OMPI shared libraries).  
Also note that we specifically do not recommend adding Open MPI's  
libdir to DYLD_LIBRARY_PATH."


http://www.open-mpi.org/faq/?category=osx

Regardless, if I set either, and run ompi_info I still get:

[saturna.cluster:94981] mca: base: component_find: iof  
"mca_iof_proxy" uses an MCA interface that is not recognized  
(component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:94981] mca: base: component_find: iof  
"mca_iof_svc" uses an MCA interface that is not recognized  
(component MCA v1.0.0 != supported MCA v2.0.0) -- ignored


echo $DYLD_LIBRARY_PATH $LD_LIBRARY_PATH
/usr/local/openmpi/lib: /usr/local/openmpi/lib:

So I'm afraid I'm stumped again.  I suppose I could go clean out  
all the libraries in /usr/lib/...


Thanks again, sorry to be a pain...

Cheers,  Jody






On Aug 10, 2009, at 7:38 PM, Klymak Jody wrote:


So,

mpirun --display-allocation -pernode --display-map hostname

gives me the output below.  Simple jobs seem to run, but the  
MITgcm does not, either under ssh or torque.  It hangs at some  
early point in execution before anything is written, so its hard  
for me to tell what the error is.  Could these MCA warnings have  
anything to do with it?


I've recompiled the gcm with -L /usr/local/openmpi/lib, so  
hopefully that catches the right library.


Thanks,  Jody


[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_dash_host" uses an MCA interface that is not recogniz

ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_hostfile" uses an MCA interface that is not recognize

d (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_localhost" uses an MCA interface that is not recogniz

ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_xgrid" uses an MCA interface that is not recognized (

component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: iof  
"mca_iof_proxy" 

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Interesting! Well, I always make sure I have my personal OMPI build  
before any system stuff, and I work exclusively on Mac OS-X:


rhc$ echo $PATH
/Library/Frameworks/Python.framework/Versions/Current/bin:/Users/rhc/ 
openmpi/bin:/Users/rhc/bin:/opt/local/bin:/usr/X11R6/bin:/usr/local/ 
bin:/opt/local/bin:/opt/local/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/ 
local/bin:/usr/texbin


rhc$ echo $LD_LIBRARY_PATH
/Users/rhc/openmpi/lib:/Users/rhc/lib:/opt/local/lib:/usr/X11R6/lib:/ 
usr/local/lib:


Note that I always configure with --prefix=somewhere-in-my-own-dir,  
never to a system directory. Avoids this kind of confusion.


What the errors are saying is that we are picking up components from a  
very old version of OMPI that is distributed by Apple. It may or may  
not be causing confusion for the system - hard to tell. However, the  
fact that it is the IO forwarding subsystem that is picking them up,  
and the fact that you aren't seeing any output from your job, makes me  
a tad suspicious.


Can you run other jobs? In other words, do you get stdout/stderr from  
other programs you run, or does every MPI program hang (even simple  
ones)? If it is just your program, then it could just be that your  
application is hanging before any output is generated. Can you have it  
print something to stderr right when it starts?



On Aug 10, 2009, at 8:53 PM, Klymak Jody wrote:



On 10-Aug-09, at 6:44 PM, Ralph Castain wrote:

Check your LD_LIBRARY_PATH - there is an earlier version of OMPI in  
your path that is interfering with operation (i.e., it comes before  
your 1.3.3 installation).


H, The OS X faq says not to do this:

"Note that there is no need to add Open MPI's libdir to  
LD_LIBRARY_PATH; Open MPI's shared library build process  
automatically uses the "rpath" mechanism to automatically find the  
correct shared libraries (i.e., the ones associated with this build,  
vs., for example, the OS X-shipped OMPI shared libraries). Also note  
that we specifically do not recommend adding Open MPI's libdir to  
DYLD_LIBRARY_PATH."


http://www.open-mpi.org/faq/?category=osx

Regardless, if I set either, and run ompi_info I still get:

[saturna.cluster:94981] mca: base: component_find: iof  
"mca_iof_proxy" uses an MCA interface that is not recognized  
(component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:94981] mca: base: component_find: iof "mca_iof_svc"  
uses an MCA interface that is not recognized (component MCA v1.0.0 ! 
= supported MCA v2.0.0) -- ignored


echo $DYLD_LIBRARY_PATH $LD_LIBRARY_PATH
/usr/local/openmpi/lib: /usr/local/openmpi/lib:

So I'm afraid I'm stumped again.  I suppose I could go clean out all  
the libraries in /usr/lib/...


Thanks again, sorry to be a pain...

Cheers,  Jody






On Aug 10, 2009, at 7:38 PM, Klymak Jody wrote:


So,

mpirun --display-allocation -pernode --display-map hostname

gives me the output below.  Simple jobs seem to run, but the  
MITgcm does not, either under ssh or torque.  It hangs at some  
early point in execution before anything is written, so its hard  
for me to tell what the error is.  Could these MCA warnings have  
anything to do with it?


I've recompiled the gcm with -L /usr/local/openmpi/lib, so  
hopefully that catches the right library.


Thanks,  Jody


[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_dash_host" uses an MCA interface that is not recogniz

ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_hostfile" uses an MCA interface that is not recognize

d (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_localhost" uses an MCA interface that is not recogniz

ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_xgrid" uses an MCA interface that is not recognized (

component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: iof  
"mca_iof_proxy" uses an MCA interface that is not recognized (

component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: iof  
"mca_iof_svc" uses an MCA interface that is not recognized (co

mponent MCA v1.0.0 != supported MCA v2.0.0) -- ignored

==   ALLOCATED NODES   ==

Data for node: Name: xserve02.localNum slots: 8Max slots: 0
Data for node: Name: xserve01.localNum slots: 8Max slots: 0

=

   JOB MAP   

Data for node: Name: xserve02.localNum procs: 1
  Process OMPI jobid: [20967,1] Process rank: 0

Data for node: Name: xserve01.localNum procs: 1
  Process OMPI jobid: [20967,1] Process rank: 1

=

Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Klymak Jody


On 10-Aug-09, at 6:44 PM, Ralph Castain wrote:

Check your LD_LIBRARY_PATH - there is an earlier version of OMPI in  
your path that is interfering with operation (i.e., it comes before  
your 1.3.3 installation).


H, The OS X faq says not to do this:

"Note that there is no need to add Open MPI's libdir to  
LD_LIBRARY_PATH; Open MPI's shared library build process automatically  
uses the "rpath" mechanism to automatically find the correct shared  
libraries (i.e., the ones associated with this build, vs., for  
example, the OS X-shipped OMPI shared libraries). Also note that we  
specifically do not recommend adding Open MPI's libdir to  
DYLD_LIBRARY_PATH."


http://www.open-mpi.org/faq/?category=osx

Regardless, if I set either, and run ompi_info I still get:

[saturna.cluster:94981] mca: base: component_find: iof "mca_iof_proxy"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  
supported MCA v2.0.0) -- ignored
[saturna.cluster:94981] mca: base: component_find: iof "mca_iof_svc"  
uses an MCA interface that is not recognized (component MCA v1.0.0 !=  
supported MCA v2.0.0) -- ignored


echo $DYLD_LIBRARY_PATH $LD_LIBRARY_PATH
/usr/local/openmpi/lib: /usr/local/openmpi/lib:

So I'm afraid I'm stumped again.  I suppose I could go clean out all  
the libraries in /usr/lib/...


Thanks again, sorry to be a pain...

Cheers,  Jody






On Aug 10, 2009, at 7:38 PM, Klymak Jody wrote:


So,

mpirun --display-allocation -pernode --display-map hostname

gives me the output below.  Simple jobs seem to run, but the MITgcm  
does not, either under ssh or torque.  It hangs at some early point  
in execution before anything is written, so its hard for me to tell  
what the error is.  Could these MCA warnings have anything to do  
with it?


I've recompiled the gcm with -L /usr/local/openmpi/lib, so  
hopefully that catches the right library.


Thanks,  Jody


[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_dash_host" uses an MCA interface that is not recogniz

ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_hostfile" uses an MCA interface that is not recognize

d (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_localhost" uses an MCA interface that is not recogniz

ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_xgrid" uses an MCA interface that is not recognized (

component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: iof  
"mca_iof_proxy" uses an MCA interface that is not recognized (

component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: iof "mca_iof_svc"  
uses an MCA interface that is not recognized (co

mponent MCA v1.0.0 != supported MCA v2.0.0) -- ignored

==   ALLOCATED NODES   ==

Data for node: Name: xserve02.localNum slots: 8Max slots: 0
Data for node: Name: xserve01.localNum slots: 8Max slots: 0

=

   JOB MAP   

Data for node: Name: xserve02.localNum procs: 1
  Process OMPI jobid: [20967,1] Process rank: 0

Data for node: Name: xserve01.localNum procs: 1
  Process OMPI jobid: [20967,1] Process rank: 1

=
[xserve01.cluster:38518] mca: base: component_find: iof  
"mca_iof_proxy" uses an MCA interface that is not recognized

(component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve01.cluster:38518] mca: base: component_find: iof  
"mca_iof_svc" uses an MCA interface that is not recognized (

component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
xserve02.local
xserve01.cluster


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Klymak Jody

So,

mpirun --display-allocation -pernode --display-map hostname

gives me the output below.  Simple jobs seem to run, but the MITgcm  
does not, either under ssh or torque.  It hangs at some early point in  
execution before anything is written, so its hard for me to tell what  
the error is.  Could these MCA warnings have anything to do with it?


I've recompiled the gcm with -L /usr/local/openmpi/lib, so hopefully  
that catches the right library.


Thanks,  Jody


[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_dash_host" uses an MCA interface that is not recogniz

ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_hostfile" uses an MCA interface that is not recognize

d (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: ras  
"mca_ras_localhost" uses an MCA interface that is not recogniz

ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: ras "mca_ras_xgrid"  
uses an MCA interface that is not recognized (

component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: iof "mca_iof_proxy"  
uses an MCA interface that is not recognized (

component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve02.local:38126] mca: base: component_find: iof "mca_iof_svc"  
uses an MCA interface that is not recognized (co

mponent MCA v1.0.0 != supported MCA v2.0.0) -- ignored

==   ALLOCATED NODES   ==

 Data for node: Name: xserve02.localNum slots: 8Max slots: 0
 Data for node: Name: xserve01.localNum slots: 8Max slots: 0

=

    JOB MAP   

 Data for node: Name: xserve02.localNum procs: 1
Process OMPI jobid: [20967,1] Process rank: 0

 Data for node: Name: xserve01.localNum procs: 1
Process OMPI jobid: [20967,1] Process rank: 1

 =
[xserve01.cluster:38518] mca: base: component_find: iof  
"mca_iof_proxy" uses an MCA interface that is not recognized

 (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[xserve01.cluster:38518] mca: base: component_find: iof "mca_iof_svc"  
uses an MCA interface that is not recognized (

component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
xserve02.local
xserve01.cluster




Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Ralph Castain
No problem - actually, that default works with any environment, not  
just Torque



On Aug 10, 2009, at 4:37 PM, Gus Correa wrote:


Thank you for the correction, Ralph.
I didn't know there was a (wise) default for the
number of processes when using Torque-enabled OpenMPI.

Gus Correa

Ralph Castain wrote:

Just to correct something said here.

You need to tell mpirun how many processes to launch,
regardless of whether you are using Torque or not.
This is not correct. If you don't tell mpirun how many processes to  
launch, we will automatically launch one process for every slot in  
your allocation. In the case described here, there were 16 slots  
allocated, so we would automatically launch 16 processes.

Ralph
On Aug 10, 2009, at 3:47 PM, Gus Correa wrote:

Hi Jody, list

See comments inline.

Jody Klymak wrote:

On Aug 10, 2009, at  13:01 PM, Gus Correa wrote:

Hi Jody

We don't have Mac OS-X, but Linux, not sure if this applies to  
you.


Did you configure your OpenMPI with Torque support,
and pointed to the same library that provides the
Torque you are using (--with-tm=/path/to/torque-library- 
directory)?

Not explicitly. I'll check into that



1) If you don't do it explicitly, configure will use the first  
libtorque

it finds (and that works I presume),
which may/may not be the one you want, if you have more than one.
If you only have one version of Torque installed,
this shouldn't be the problem.

2) Have you tried something very simple, like the examples/hello_c.c
program, to test the Torque-OpenMPI integration?

3) Also, just in case, put a "cat $PBS_NODEFILE" inside your script,
before mpirun, to see what it reports.
For  "#PBS -l nodes=2:ppn=8"
it should show 16 lines, 8 with the name of each node.

4) Finally, just to make sure the syntax is right.
On your message you wrote:

>>> If I submit openMPI with:
>>> #PBS -l nodes=2:ppn=8
>>> mpirun MyProg

Is this the real syntax you used?

Or was it perhaps:

#PBS -l nodes=2:ppn=8
mpirun -n 16 MyProg

You need to tell mpirun how many processes to launch,
regardless of whether you are using Torque or not.

My $0.02

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-



Are you using the right mpirun? (There are so many out there.)

yeah - I use the explicit path and moved the OS X one.
Thanks!  Jody

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-

Jody Klymak wrote:

Hi All,
I've been trying to get torque pbs to work on my OS X 10.5.7  
cluster with openMPI (after finding that Xgrid was pretty flaky  
about connections).  I *think* this is an MPI problem (perhaps  
via operator error!)

If I submit openMPI with:
#PBS -l nodes=2:ppn=8
mpirun MyProg
pbs locks off two of the processors, checked via "pbsnodes -a",  
and the job output.  But mpirun runs the whole job on the  
second of the two processors.

If I run the same job w/o qsub (i.e. using ssh)
mpirun -n 16 -host xserve01,xserve02 MyProg
it runs fine on all the nodes
My /var/spool/toque/server_priv/nodes file looks like:
xserve01.local np=8
xserve02.local np=8
Any idea what could be going wrong or how to debu this  
properly? There is nothing suspicious in the server or mom logs.

Thanks for any help,
Jody
--
Jody Klymak
http://web.uvic.ca/~jklymak/
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Jody Klymak
http://web.uvic.ca/~jklymak/
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Gus Correa

Thank you for the correction, Ralph.
I didn't know there was a (wise) default for the
number of processes when using Torque-enabled OpenMPI.

Gus Correa

Ralph Castain wrote:

Just to correct something said here.


You need to tell mpirun how many processes to launch,
regardless of whether you are using Torque or not.


This is not correct. If you don't tell mpirun how many processes to 
launch, we will automatically launch one process for every slot in your 
allocation. In the case described here, there were 16 slots allocated, 
so we would automatically launch 16 processes.


Ralph



On Aug 10, 2009, at 3:47 PM, Gus Correa wrote:


Hi Jody, list

See comments inline.

Jody Klymak wrote:

On Aug 10, 2009, at  13:01 PM, Gus Correa wrote:

Hi Jody

We don't have Mac OS-X, but Linux, not sure if this applies to you.

Did you configure your OpenMPI with Torque support,
and pointed to the same library that provides the
Torque you are using (--with-tm=/path/to/torque-library-directory)?

Not explicitly. I'll check into that



1) If you don't do it explicitly, configure will use the first libtorque
it finds (and that works I presume),
which may/may not be the one you want, if you have more than one.
If you only have one version of Torque installed,
this shouldn't be the problem.

2) Have you tried something very simple, like the examples/hello_c.c
program, to test the Torque-OpenMPI integration?

3) Also, just in case, put a "cat $PBS_NODEFILE" inside your script,
before mpirun, to see what it reports.
For  "#PBS -l nodes=2:ppn=8"
it should show 16 lines, 8 with the name of each node.

4) Finally, just to make sure the syntax is right.
On your message you wrote:

>>> If I submit openMPI with:
>>> #PBS -l nodes=2:ppn=8
>>> mpirun MyProg

Is this the real syntax you used?

Or was it perhaps:

#PBS -l nodes=2:ppn=8
mpirun -n 16 MyProg

You need to tell mpirun how many processes to launch,
regardless of whether you are using Torque or not.

My $0.02

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-



Are you using the right mpirun? (There are so many out there.)

yeah - I use the explicit path and moved the OS X one.
Thanks!  Jody

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-

Jody Klymak wrote:

Hi All,
I've been trying to get torque pbs to work on my OS X 10.5.7 
cluster with openMPI (after finding that Xgrid was pretty flaky 
about connections).  I *think* this is an MPI problem (perhaps via 
operator error!)

If I submit openMPI with:
#PBS -l nodes=2:ppn=8
mpirun MyProg
pbs locks off two of the processors, checked via "pbsnodes -a", and 
the job output.  But mpirun runs the whole job on the second of the 
two processors.

If I run the same job w/o qsub (i.e. using ssh)
mpirun -n 16 -host xserve01,xserve02 MyProg
it runs fine on all the nodes
My /var/spool/toque/server_priv/nodes file looks like:
xserve01.local np=8
xserve02.local np=8
Any idea what could be going wrong or how to debu this properly? 
There is nothing suspicious in the server or mom logs.

Thanks for any help,
Jody
--
Jody Klymak
http://web.uvic.ca/~jklymak/
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Jody Klymak
http://web.uvic.ca/~jklymak/
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Ralph Castain

No problem - yes indeed, 1.1.x would be a bad choice :-)

On Aug 10, 2009, at 3:58 PM, Jody Klymak wrote:



On Aug 10, 2009, at  14:39 PM, Ralph Castain wrote:


mpirun --display-allocation -pernode --display-map hostname



Ummm, hmm, this is embarassing, none of those command line arguments  
worked, making me suspicious...


It looks like somehow I decided to build and run openMPI 1.1.5, or  
at least

mpirun --version tells me that is the mpirun version.

I'll get back to you when I get time to rebuild with 1.3.3.  Could  
be that this is the source of my xgrid problems as well.


Sorry for the noise.  I'll get back to you if I still have problems...

Thanks,  Jody

--
Jody Klymak
http://web.uvic.ca/~jklymak/




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Jody Klymak


On Aug 10, 2009, at  14:39 PM, Ralph Castain wrote:


mpirun --display-allocation -pernode --display-map hostname



Ummm, hmm, this is embarassing, none of those command line arguments  
worked, making me suspicious...


It looks like somehow I decided to build and run openMPI 1.1.5, or at  
least

mpirun --version tells me that is the mpirun version.

I'll get back to you when I get time to rebuild with 1.3.3.  Could be  
that this is the source of my xgrid problems as well.


Sorry for the noise.  I'll get back to you if I still have problems...

Thanks,  Jody

--
Jody Klymak
http://web.uvic.ca/~jklymak/






Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Gus Correa

Hi Jody, list

See comments inline.

Jody Klymak wrote:


On Aug 10, 2009, at  13:01 PM, Gus Correa wrote:


Hi Jody

We don't have Mac OS-X, but Linux, not sure if this applies to you.

Did you configure your OpenMPI with Torque support,
and pointed to the same library that provides the
Torque you are using (--with-tm=/path/to/torque-library-directory)?


Not explicitly. I'll check into that



1) If you don't do it explicitly, configure will use the first libtorque
it finds (and that works I presume),
which may/may not be the one you want, if you have more than one.
If you only have one version of Torque installed,
this shouldn't be the problem.

2) Have you tried something very simple, like the examples/hello_c.c
program, to test the Torque-OpenMPI integration?

3) Also, just in case, put a "cat $PBS_NODEFILE" inside your script,
before mpirun, to see what it reports.
For  "#PBS -l nodes=2:ppn=8"
it should show 16 lines, 8 with the name of each node.

4) Finally, just to make sure the syntax is right.
On your message you wrote:

>>> If I submit openMPI with:
>>> #PBS -l nodes=2:ppn=8
>>> mpirun MyProg

Is this the real syntax you used?

Or was it perhaps:

#PBS -l nodes=2:ppn=8
mpirun -n 16 MyProg

You need to tell mpirun how many processes to launch,
regardless of whether you are using Torque or not.

My $0.02

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-






Are you using the right mpirun? (There are so many out there.)


yeah - I use the explicit path and moved the OS X one.

Thanks!  Jody


Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-

Jody Klymak wrote:

Hi All,
I've been trying to get torque pbs to work on my OS X 10.5.7 cluster 
with openMPI (after finding that Xgrid was pretty flaky about 
connections).  I *think* this is an MPI problem (perhaps via operator 
error!)

If I submit openMPI with:
#PBS -l nodes=2:ppn=8
mpirun MyProg
pbs locks off two of the processors, checked via "pbsnodes -a", and 
the job output.  But mpirun runs the whole job on the second of the 
two processors.

If I run the same job w/o qsub (i.e. using ssh)
mpirun -n 16 -host xserve01,xserve02 MyProg
it runs fine on all the nodes
My /var/spool/toque/server_priv/nodes file looks like:
xserve01.local np=8
xserve02.local np=8
Any idea what could be going wrong or how to debu this properly? 
There is nothing suspicious in the server or mom logs.

Thanks for any help,
Jody
--
Jody Klymak
http://web.uvic.ca/~jklymak/
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jody Klymak
http://web.uvic.ca/~jklymak/




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Ralph Castain


On Aug 10, 2009, at 3:25 PM, Jody Klymak wrote:


Hi Ralph,

On Aug 10, 2009, at  13:04 PM, Ralph Castain wrote:


Umm...are you saying that your $PBS_NODEFILE contains the following:


No, if I put cat $PBS_NODEFILE in the pbs script I get
xserve02.local
...
xserve02.local
xserve01.local
...
xserve01.local

each repeated 8 times.  So that seems to be working


Good!





xserve01.local np=8
xserve02.local np=8


If so, that could be part of the problem - it isn't the standard  
notation we are expecting to see in that file. What Torque normally  
provides is one line for each slot, so we would expect to see  
"xserve01.local" repeated 8 times, followed by "xserve02.local"  
repeated 8 times. Given the different syntax, we may not be parsing  
the file correctly. How was this file created?


The file I am referring to above is the $TORQUEHOME/server_priv/ 
nodes file, that I created it by hand based on my understanding of  
the docs at:


http://www.clusterresources.com/torquedocs/nodeconfig.shtml


OMPI doesn't care about that file - only Torque looks at it.





Also, could you clarify what node mpirun is executing on?


mpirun seems to only run on xserve02

The job I'm running is just creating a file:

#!/bin/bash

H=`hostname`
echo $H
sleep 10
uptime >&  $H.txt

In the stdout, the echo $H returns
"xserve02.local" 16 times and only xsever02.local.txt gets created...

Again, if I run with "ssh" outside of pbs I get the expected response.


Try running:

mpirun --display-allocation -pernode --display-map hostname

This will tell us what OMPI is seeing in terms of the nodes available  
to it. Based on what you show above, it should see both of your nodes.  
By forcing OMPI to put one proc/node, you'll be directing it to use  
both nodes for the job. You should see this in the reported map.


If we then see both procs run on the same node, I would suggest  
reconfiguring OMPI with --enable-debug, and then rerunning the above  
command with an additional flag:


-mca plm_base_verbose 5

which will show us all the ugly details of what OMPI is telling Torque  
to do. Since OMPI works fine with Torque on Linux, my guess is that  
there is something about the Torque for Mac that is a little different  
and thus causing problems.


Ralph





Thanks,  Jody





Ralph

On Aug 10, 2009, at 1:43 PM, Jody Klymak wrote:



Hi All,

I've been trying to get torque pbs to work on my OS X 10.5.7  
cluster with openMPI (after finding that Xgrid was pretty flaky  
about connections).  I *think* this is an MPI problem (perhaps via  
operator error!)


If I submit openMPI with:


#PBS -l nodes=2:ppn=8

mpirun MyProg


pbs locks off two of the processors, checked via "pbsnodes -a",  
and the job output.  But mpirun runs the whole job on the second  
of the two processors.


If I run the same job w/o qsub (i.e. using ssh)
mpirun -n 16 -host xserve01,xserve02 MyProg
it runs fine on all the nodes

My /var/spool/toque/server_priv/nodes file looks like:

xserve01.local np=8
xserve02.local np=8


Any idea what could be going wrong or how to debu this properly?  
There is nothing suspicious in the server or mom logs.


Thanks for any help,

Jody





--
Jody Klymak
http://web.uvic.ca/~jklymak/




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jody Klymak
http://web.uvic.ca/~jklymak/




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Jody Klymak

Hi Ralph,

On Aug 10, 2009, at  13:04 PM, Ralph Castain wrote:


Umm...are you saying that your $PBS_NODEFILE contains the following:


No, if I put cat $PBS_NODEFILE in the pbs script I get
xserve02.local
...
xserve02.local
xserve01.local
...
xserve01.local

each repeated 8 times.  So that seems to be working



xserve01.local np=8
xserve02.local np=8


If so, that could be part of the problem - it isn't the standard  
notation we are expecting to see in that file. What Torque normally  
provides is one line for each slot, so we would expect to see  
"xserve01.local" repeated 8 times, followed by "xserve02.local"  
repeated 8 times. Given the different syntax, we may not be parsing  
the file correctly. How was this file created?


The file I am referring to above is the $TORQUEHOME/server_priv/nodes  
file, that I created it by hand based on my understanding of the docs  
at:


http://www.clusterresources.com/torquedocs/nodeconfig.shtml



Also, could you clarify what node mpirun is executing on?


mpirun seems to only run on xserve02

The job I'm running is just creating a file:

#!/bin/bash

H=`hostname`
echo $H
sleep 10
uptime >&  $H.txt

In the stdout, the echo $H returns
"xserve02.local" 16 times and only xsever02.local.txt gets created...

Again, if I run with "ssh" outside of pbs I get the expected response.


Thanks,  Jody





Ralph

On Aug 10, 2009, at 1:43 PM, Jody Klymak wrote:



Hi All,

I've been trying to get torque pbs to work on my OS X 10.5.7  
cluster with openMPI (after finding that Xgrid was pretty flaky  
about connections).  I *think* this is an MPI problem (perhaps via  
operator error!)


If I submit openMPI with:


#PBS -l nodes=2:ppn=8

mpirun MyProg


pbs locks off two of the processors, checked via "pbsnodes -a", and  
the job output.  But mpirun runs the whole job on the second of the  
two processors.


If I run the same job w/o qsub (i.e. using ssh)
mpirun -n 16 -host xserve01,xserve02 MyProg
it runs fine on all the nodes

My /var/spool/toque/server_priv/nodes file looks like:

xserve01.local np=8
xserve02.local np=8


Any idea what could be going wrong or how to debu this properly?  
There is nothing suspicious in the server or mom logs.


Thanks for any help,

Jody





--
Jody Klymak
http://web.uvic.ca/~jklymak/




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jody Klymak
http://web.uvic.ca/~jklymak/






Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Jody Klymak


On Aug 10, 2009, at  13:01 PM, Gus Correa wrote:


Hi Jody

We don't have Mac OS-X, but Linux, not sure if this applies to you.

Did you configure your OpenMPI with Torque support,
and pointed to the same library that provides the
Torque you are using (--with-tm=/path/to/torque-library-directory)?


Not explicitly. I'll check into that



Are you using the right mpirun? (There are so many out there.)


yeah - I use the explicit path and moved the OS X one.

Thanks!  Jody


Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-

Jody Klymak wrote:

Hi All,
I've been trying to get torque pbs to work on my OS X 10.5.7  
cluster with openMPI (after finding that Xgrid was pretty flaky  
about connections).  I *think* this is an MPI problem (perhaps via  
operator error!)

If I submit openMPI with:
#PBS -l nodes=2:ppn=8
mpirun MyProg
pbs locks off two of the processors, checked via "pbsnodes -a", and  
the job output.  But mpirun runs the whole job on the second of the  
two processors.

If I run the same job w/o qsub (i.e. using ssh)
mpirun -n 16 -host xserve01,xserve02 MyProg
it runs fine on all the nodes
My /var/spool/toque/server_priv/nodes file looks like:
xserve01.local np=8
xserve02.local np=8
Any idea what could be going wrong or how to debu this properly?  
There is nothing suspicious in the server or mom logs.

Thanks for any help,
Jody
--
Jody Klymak
http://web.uvic.ca/~jklymak/
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jody Klymak
http://web.uvic.ca/~jklymak/






Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Ralph Castain

Umm...are you saying that your $PBS_NODEFILE contains the following:


xserve01.local np=8
xserve02.local np=8


If so, that could be part of the problem - it isn't the standard  
notation we are expecting to see in that file. What Torque normally  
provides is one line for each slot, so we would expect to see  
"xserve01.local" repeated 8 times, followed by "xserve02.local"  
repeated 8 times. Given the different syntax, we may not be parsing  
the file correctly. How was this file created?


Also, could you clarify what node mpirun is executing on?

Ralph

On Aug 10, 2009, at 1:43 PM, Jody Klymak wrote:



Hi All,

I've been trying to get torque pbs to work on my OS X 10.5.7 cluster  
with openMPI (after finding that Xgrid was pretty flaky about  
connections).  I *think* this is an MPI problem (perhaps via  
operator error!)


If I submit openMPI with:


#PBS -l nodes=2:ppn=8

mpirun MyProg


pbs locks off two of the processors, checked via "pbsnodes -a", and  
the job output.  But mpirun runs the whole job on the second of the  
two processors.


If I run the same job w/o qsub (i.e. using ssh)
mpirun -n 16 -host xserve01,xserve02 MyProg
it runs fine on all the nodes

My /var/spool/toque/server_priv/nodes file looks like:

xserve01.local np=8
xserve02.local np=8


Any idea what could be going wrong or how to debu this properly?  
There is nothing suspicious in the server or mom logs.


Thanks for any help,

Jody





--
Jody Klymak
http://web.uvic.ca/~jklymak/




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users