Re: [OMPI users] torque pbs behaviour...
Well, it now is launching just fine, so that's one thing! :-) Afraid I'll have to let the TCP btl guys take over from here. It looks like everything is up and running, but something strange is going on in the MPI comm layer. You can turn off those mca params I gave you as you are now past that point. I know there are others that can help debug that TCP btl error, but they can help you there. Ralph On Tue, Aug 11, 2009 at 8:54 AM, Klymak Jodywrote: > > On 11-Aug-09, at 6:16 AM, Jeff Squyres wrote: > > This means that OMPI is finding an mca_iof_proxy.la file at run time from >> a prior version of Open MPI. You might want to use "find" or "locate" to >> search your nodes and find it. I suspect that you somehow have an OMPI >> 1.3.x install that overlaid an install of a prior OMPI version installation. >> > > > OK, right you were - the old file was in my new install directory. I > didn't erase /usr/local/openmpi before re-running the install... > > However, after reinstalling on the nodes (but not cleaning out /usr/lib on > all the nodes) I still have the following: > > Thanks, Jody > > > saturna.cluster:17660] mca:base:select:( plm) Querying component [rsh] > [saturna.cluster:17660] mca:base:select:( plm) Query of component [rsh] > set priority to 10 > [saturna.cluster:17660] mca:base:select:( plm) Querying component [slurm] > [saturna.cluster:17660] mca:base:select:( plm) Skipping component [slurm]. > Query failed to return a module > [saturna.cluster:17660] mca:base:select:( plm) Querying component [tm] > [saturna.cluster:17660] mca:base:select:( plm) Skipping component [tm]. > Query failed to return a module > [saturna.cluster:17660] mca:base:select:( plm) Querying component [xgrid] > [saturna.cluster:17660] mca:base:select:( plm) Skipping component [xgrid]. > Query failed to return a module > [saturna.cluster:17660] mca:base:select:( plm) Selected component [rsh] > [saturna.cluster:17660] plm:base:set_hnp_name: initial bias 17660 nodename > hash 1656374957 > [saturna.cluster:17660] plm:base:set_hnp_name: final jobfam 24811 > [saturna.cluster:17660] [[24811,0],0] plm:base:receive start comm > [saturna.cluster:17660] mca:base:select:( odls) Querying component > [default] > [saturna.cluster:17660] mca:base:select:( odls) Query of component > [default] set priority to 1 > [saturna.cluster:17660] mca:base:select:( odls) Selected component > [default] > [saturna.cluster:17660] [[24811,0],0] plm:rsh: setting up job [24811,1] > [saturna.cluster:17660] [[24811,0],0] plm:base:setup_job for job [24811,1] > [saturna.cluster:17660] [[24811,0],0] plm:rsh: local shell: 0 (bash) > [saturna.cluster:17660] [[24811,0],0] plm:rsh: assuming same remote shell > as local shell > [saturna.cluster:17660] [[24811,0],0] plm:rsh: remote shell: 0 (bash) > [saturna.cluster:17660] [[24811,0],0] plm:rsh: final template argv: >/usr/bin/ssh PATH=/usr/local/openmpi/bin:$PATH ; export > PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH ; export > LD_LIBRARY_PATH ; /usr/local/openmpi/bin/orted --debug-daemons -mca ess env > -mca orte_ess_jobid 1626013696 -mca orte_ess_vpid -mca > orte_ess_num_procs 3 --hnp-uri "1626013696.0;tcp://142.104.154.96:49710 > ;tcp://192.168.2.254:49710" -mca plm_base_verbose 5 -mca odls_base_verbose > 5 > [saturna.cluster:17660] [[24811,0],0] plm:rsh: launching on node xserve01 > [saturna.cluster:17660] [[24811,0],0] plm:rsh: recording launch of daemon > [[24811,0],1] > [saturna.cluster:17660] [[24811,0],0] plm:rsh: executing: (//usr/bin/ssh) > [/usr/bin/ssh xserve01 PATH=/usr/local/openmpi/bin:$PATH ; export PATH ; > LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH ; export > LD_LIBRARY_PATH ; /usr/local/openmpi/bin/orted --debug-daemons -mca ess env > -mca orte_ess_jobid 1626013696 -mca orte_ess_vpid 1 -mca orte_ess_num_procs > 3 --hnp-uri "1626013696.0;tcp://142.104.154.96:49710;tcp:// > 192.168.2.254:49710" -mca plm_base_verbose 5 -mca odls_base_verbose 5] > Daemon was launched on xserve01.cluster - beginning to initialize > [xserve01.cluster:42519] mca:base:select:( odls) Querying component > [default] > [xserve01.cluster:42519] mca:base:select:( odls) Query of component > [default] set priority to 1 > [xserve01.cluster:42519] mca:base:select:( odls) Selected component > [default] > Daemon [[24811,0],1] checking in as pid 42519 on host xserve01.cluster > Daemon [[24811,0],1] not using static ports > [saturna.cluster:17660] [[24811,0],0] plm:rsh: launching on node xserve02 > [saturna.cluster:17660] [[24811,0],0] plm:rsh: recording launch of daemon > [[24811,0],2] > [saturna.cluster:17660] [[24811,0],0] plm:rsh: executing: (//usr/bin/ssh) > [/usr/bin/ssh xserve02 PATH=/usr/local/openmpi/bin:$PATH ; export PATH ; > LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH ; export > LD_LIBRARY_PATH ; /usr/local/openmpi/bin/orted --debug-daemons -mca ess env > -mca orte_ess_jobid 1626013696 -mca orte_ess_vpid 2 -mca orte_ess_num_procs > 3
Re: [OMPI users] torque pbs behaviour...
Yeah, it's the lib confusion that's the problem - this is the problem: [saturna.cluster:07360] [[14551,0],0] ORTE_ERROR_LOG: Buffer type (described > vs non-described) mismatch - operation not allowed in file > base/odls_base_default_fns.c at line 2475 > Have you tried configuring with --enable-mpirun-prefix-by-default? This would help avoid the confusion. You also should check your path to ensure that it is correct as well (make sure that mpirun is the one you expect, and that you are getting the corresponding remote orted). Ralph On Tue, Aug 11, 2009 at 8:23 AM, Klymak Jodywrote: > > On 11-Aug-09, at 7:03 AM, Ralph Castain wrote: > > Sigh - too early in the morning for this old brain, I fear... > > You are right - the ranks are fine, and local rank doesn't matter. It > sounds like a problem where the TCP messaging is getting a message ack'd > from someone other than the process that was supposed to recv the message. > This should cause us to abort, but we were just talking on the phone that > the abort procedure may not be working correctly. Or it could be (as Jeff > suggests) that the version mismatch is also preventing us from properly > aborting too. > > So I fear we are back to trying to find these other versions on your > nodes... > > > Well, the old version is still on the nodes (in /usr/lib as default for OS > X)... > > I can try and clean those all out by hand but I'm still confused why the > old version would be used - how does openMPI find the right library? > > Note again, that I get these MCA warnings on the server when just running > ompi_info and I *have* cleaned out /usr/lib on the server. So I really > don't understand how on the server I can still have a library issue. Is > there a way to trace at runtime what library an executable is dynamically > linking to? Can I rebuild openmpi statically? > > Thanks, Jody > > > > > On Tue, Aug 11, 2009 at 7:43 AM, Klymak Jody wrote: > >> >> On 11-Aug-09, at 6:28 AM, Ralph Castain wrote: >> >> The reason your job is hanging is sitting in the orte-ps output. You have >>> multiple processes declaring themselves to be the same MPI rank. That >>> definitely won't work. >>> >> >> Its the "local rank" if that makes any difference... >> >> Any thoughts on this output? >> >> [xserve03.local][[61029,1],4][btl_tcp_endpoint.c:486:mca_btl_tcp_endpoint_recv_connect_ack] >> received unexpected process identifier [[61029,1],3] >> >> The question is why is that happening? We use Torque all the time, so we >>> know that the basic support is correct. It -could- be related to lib >>> confusion, but I can't tell for sure. >>> >> >> Just to be clear, this is not going through torque at this point. Its >> just vanilla ssh, for which this code worked with 1.1.5. >> >> >> Can you rebuild OMPI with --enable-debug, and rerun the job with the >>> following added to your cmd line? >>> >>> -mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5 >>> >> >> Working on this... >> >> Thanks, Jody >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] torque pbs behaviour...
On 11-Aug-09, at 6:16 AM, Jeff Squyres wrote: This means that OMPI is finding an mca_iof_proxy.la file at run time from a prior version of Open MPI. You might want to use "find" or "locate" to search your nodes and find it. I suspect that you somehow have an OMPI 1.3.x install that overlaid an install of a prior OMPI version installation. OK, right you were - the old file was in my new install directory. I didn't erase /usr/local/openmpi before re-running the install... However, after reinstalling on the nodes (but not cleaning out /usr/ lib on all the nodes) I still have the following: Thanks, Jody saturna.cluster:17660] mca:base:select:( plm) Querying component [rsh] [saturna.cluster:17660] mca:base:select:( plm) Query of component [rsh] set priority to 10 [saturna.cluster:17660] mca:base:select:( plm) Querying component [slurm] [saturna.cluster:17660] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [saturna.cluster:17660] mca:base:select:( plm) Querying component [tm] [saturna.cluster:17660] mca:base:select:( plm) Skipping component [tm]. Query failed to return a module [saturna.cluster:17660] mca:base:select:( plm) Querying component [xgrid] [saturna.cluster:17660] mca:base:select:( plm) Skipping component [xgrid]. Query failed to return a module [saturna.cluster:17660] mca:base:select:( plm) Selected component [rsh] [saturna.cluster:17660] plm:base:set_hnp_name: initial bias 17660 nodename hash 1656374957 [saturna.cluster:17660] plm:base:set_hnp_name: final jobfam 24811 [saturna.cluster:17660] [[24811,0],0] plm:base:receive start comm [saturna.cluster:17660] mca:base:select:( odls) Querying component [default] [saturna.cluster:17660] mca:base:select:( odls) Query of component [default] set priority to 1 [saturna.cluster:17660] mca:base:select:( odls) Selected component [default] [saturna.cluster:17660] [[24811,0],0] plm:rsh: setting up job [24811,1] [saturna.cluster:17660] [[24811,0],0] plm:base:setup_job for job [24811,1] [saturna.cluster:17660] [[24811,0],0] plm:rsh: local shell: 0 (bash) [saturna.cluster:17660] [[24811,0],0] plm:rsh: assuming same remote shell as local shell [saturna.cluster:17660] [[24811,0],0] plm:rsh: remote shell: 0 (bash) [saturna.cluster:17660] [[24811,0],0] plm:rsh: final template argv: /usr/bin/ssh PATH=/usr/local/openmpi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /usr/local/openmpi/bin/orted --debug-daemons -mca ess env -mca orte_ess_jobid 1626013696 -mca orte_ess_vpid -mca orte_ess_num_procs 3 --hnp-uri "1626013696.0;tcp:// 142.104.154.96:49710;tcp://192.168.2.254:49710" -mca plm_base_verbose 5 -mca odls_base_verbose 5 [saturna.cluster:17660] [[24811,0],0] plm:rsh: launching on node xserve01 [saturna.cluster:17660] [[24811,0],0] plm:rsh: recording launch of daemon [[24811,0],1] [saturna.cluster:17660] [[24811,0],0] plm:rsh: executing: (//usr/bin/ ssh) [/usr/bin/ssh xserve01 PATH=/usr/local/openmpi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib: $LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /usr/local/openmpi/bin/ orted --debug-daemons -mca ess env -mca orte_ess_jobid 1626013696 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri "1626013696.0;tcp://142.104.154.96:49710;tcp://192.168.2.254:49710" - mca plm_base_verbose 5 -mca odls_base_verbose 5] Daemon was launched on xserve01.cluster - beginning to initialize [xserve01.cluster:42519] mca:base:select:( odls) Querying component [default] [xserve01.cluster:42519] mca:base:select:( odls) Query of component [default] set priority to 1 [xserve01.cluster:42519] mca:base:select:( odls) Selected component [default] Daemon [[24811,0],1] checking in as pid 42519 on host xserve01.cluster Daemon [[24811,0],1] not using static ports [saturna.cluster:17660] [[24811,0],0] plm:rsh: launching on node xserve02 [saturna.cluster:17660] [[24811,0],0] plm:rsh: recording launch of daemon [[24811,0],2] [saturna.cluster:17660] [[24811,0],0] plm:rsh: executing: (//usr/bin/ ssh) [/usr/bin/ssh xserve02 PATH=/usr/local/openmpi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib: $LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /usr/local/openmpi/bin/ orted --debug-daemons -mca ess env -mca orte_ess_jobid 1626013696 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri "1626013696.0;tcp://142.104.154.96:49710;tcp://192.168.2.254:49710" - mca plm_base_verbose 5 -mca odls_base_verbose 5] Daemon was launched on xserve02.local - beginning to initialize [xserve02.local:42180] mca:base:select:( odls) Querying component [default] [xserve02.local:42180] mca:base:select:( odls) Query of component [default] set priority to 1 [xserve02.local:42180] mca:base:select:( odls) Selected component [default] Daemon [[24811,0],2] checking in as pid 42180 on host xserve02.local Daemon [[24811,0],2] not using
Re: [OMPI users] torque pbs behaviour...
On 11-Aug-09, at 7:03 AM, Ralph Castain wrote: Sigh - too early in the morning for this old brain, I fear... You are right - the ranks are fine, and local rank doesn't matter. It sounds like a problem where the TCP messaging is getting a message ack'd from someone other than the process that was supposed to recv the message. This should cause us to abort, but we were just talking on the phone that the abort procedure may not be working correctly. Or it could be (as Jeff suggests) that the version mismatch is also preventing us from properly aborting too. So I fear we are back to trying to find these other versions on your nodes... Well, the old version is still on the nodes (in /usr/lib as default for OS X)... I can try and clean those all out by hand but I'm still confused why the old version would be used - how does openMPI find the right library? Note again, that I get these MCA warnings on the server when just running ompi_info and I *have* cleaned out /usr/lib on the server. So I really don't understand how on the server I can still have a library issue. Is there a way to trace at runtime what library an executable is dynamically linking to? Can I rebuild openmpi statically? Thanks, Jody On Tue, Aug 11, 2009 at 7:43 AM, Klymak Jodywrote: On 11-Aug-09, at 6:28 AM, Ralph Castain wrote: The reason your job is hanging is sitting in the orte-ps output. You have multiple processes declaring themselves to be the same MPI rank. That definitely won't work. Its the "local rank" if that makes any difference... Any thoughts on this output? [xserve03.local][[61029,1],4][btl_tcp_endpoint.c: 486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[61029,1],3] The question is why is that happening? We use Torque all the time, so we know that the basic support is correct. It -could- be related to lib confusion, but I can't tell for sure. Just to be clear, this is not going through torque at this point. Its just vanilla ssh, for which this code worked with 1.1.5. Can you rebuild OMPI with --enable-debug, and rerun the job with the following added to your cmd line? -mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5 Working on this... Thanks, Jody ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] torque pbs behaviour...
On 11-Aug-09, at 6:28 AM, Ralph Castain wrote: -mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5 I'm afraid the output will be a tad verbose, but I would appreciate seeing it. Might also tell us something about the lib issue. Command line was: /usr/local/openmpi/bin/mpirun -mca plm_base_verbose 5 --debug-daemons - mca odls_base_verbose 5 -n 16 --host xserve03,xserve04 ../build/mitgcmuv Starting: ../results//TasGaussRestart16 [saturna.cluster:07360] mca:base:select:( plm) Querying component [rsh] [saturna.cluster:07360] mca:base:select:( plm) Query of component [rsh] set priority to 10 [saturna.cluster:07360] mca:base:select:( plm) Querying component [slurm] [saturna.cluster:07360] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [saturna.cluster:07360] mca:base:select:( plm) Querying component [tm] [saturna.cluster:07360] mca:base:select:( plm) Skipping component [tm]. Query failed to return a module [saturna.cluster:07360] mca:base:select:( plm) Querying component [xgrid] [saturna.cluster:07360] mca:base:select:( plm) Skipping component [xgrid]. Query failed to return a module [saturna.cluster:07360] mca:base:select:( plm) Selected component [rsh] [saturna.cluster:07360] plm:base:set_hnp_name: initial bias 7360 nodename hash 1656374957 [saturna.cluster:07360] plm:base:set_hnp_name: final jobfam 14551 [saturna.cluster:07360] [[14551,0],0] plm:base:receive start comm [saturna.cluster:07360] mca: base: component_find: ras "mca_ras_dash_host" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [saturna.cluster:07360] mca: base: component_find: ras "mca_ras_hostfile" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [saturna.cluster:07360] mca: base: component_find: ras "mca_ras_localhost" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [saturna.cluster:07360] mca: base: component_find: ras "mca_ras_xgrid" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [saturna.cluster:07360] mca:base:select:( odls) Querying component [default] [saturna.cluster:07360] mca:base:select:( odls) Query of component [default] set priority to 1 [saturna.cluster:07360] mca:base:select:( odls) Selected component [default] [saturna.cluster:07360] mca: base: component_find: iof "mca_iof_proxy" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [saturna.cluster:07360] mca: base: component_find: iof "mca_iof_svc" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [saturna.cluster:07360] [[14551,0],0] plm:rsh: setting up job [14551,1] [saturna.cluster:07360] [[14551,0],0] plm:base:setup_job for job [14551,1] [saturna.cluster:07360] [[14551,0],0] plm:rsh: local shell: 0 (bash) [saturna.cluster:07360] [[14551,0],0] plm:rsh: assuming same remote shell as local shell [saturna.cluster:07360] [[14551,0],0] plm:rsh: remote shell: 0 (bash) [saturna.cluster:07360] [[14551,0],0] plm:rsh: final template argv: /usr/bin/ssh PATH=/usr/local/openmpi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /usr/local/openmpi/bin/orted --debug-daemons -mca ess env -mca orte_ess_jobid 953614336 -mca orte_ess_vpid -mca orte_ess_num_procs 3 --hnp-uri "953614336.0;tcp:// 142.104.154.96:49622;tcp://192.168.2.254:49622" -mca plm_base_verbose 5 -mca odls_base_verbose 5 [saturna.cluster:07360] [[14551,0],0] plm:rsh: launching on node xserve03 [saturna.cluster:07360] [[14551,0],0] plm:rsh: recording launch of daemon [[14551,0],1] [saturna.cluster:07360] [[14551,0],0] plm:rsh: executing: (//usr/bin/ ssh) [/usr/bin/ssh xserve03 PATH=/usr/local/openmpi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib: $LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /usr/local/openmpi/bin/ orted --debug-daemons -mca ess env -mca orte_ess_jobid 953614336 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri "953614336.0;tcp:// 142.104.154.96:49622;tcp://192.168.2.254:49622" -mca plm_base_verbose 5 -mca odls_base_verbose 5] Daemon was launched on xserve03.local - beginning to initialize [xserve03.local:40708] mca:base:select:( odls) Querying component [default] [xserve03.local:40708] mca:base:select:( odls) Query of component [default] set priority to 1 [xserve03.local:40708] mca:base:select:( odls) Selected component [default] [xserve03.local:40708] mca: base: component_find: iof "mca_iof_proxy" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve03.local:40708] mca: base: component_find: iof "mca_iof_svc" uses an MCA interface that is not recognized (component MCA v1.0.0 !=
Re: [OMPI users] torque pbs behaviour...
Sigh - too early in the morning for this old brain, I fear... You are right - the ranks are fine, and local rank doesn't matter. It sounds like a problem where the TCP messaging is getting a message ack'd from someone other than the process that was supposed to recv the message. This should cause us to abort, but we were just talking on the phone that the abort procedure may not be working correctly. Or it could be (as Jeff suggests) that the version mismatch is also preventing us from properly aborting too. So I fear we are back to trying to find these other versions on your nodes... On Tue, Aug 11, 2009 at 7:43 AM, Klymak Jodywrote: > > On 11-Aug-09, at 6:28 AM, Ralph Castain wrote: > > The reason your job is hanging is sitting in the orte-ps output. You have >> multiple processes declaring themselves to be the same MPI rank. That >> definitely won't work. >> > > Its the "local rank" if that makes any difference... > > Any thoughts on this output? > > [xserve03.local][[61029,1],4][btl_tcp_endpoint.c:486:mca_btl_tcp_endpoint_recv_connect_ack] > received unexpected process identifier [[61029,1],3] > > The question is why is that happening? We use Torque all the time, so we >> know that the basic support is correct. It -could- be related to lib >> confusion, but I can't tell for sure. >> > > Just to be clear, this is not going through torque at this point. Its just > vanilla ssh, for which this code worked with 1.1.5. > > > Can you rebuild OMPI with --enable-debug, and rerun the job with the >> following added to your cmd line? >> >> -mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5 >> > > Working on this... > > Thanks, Jody > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] torque pbs behaviour...
On Aug 11, 2009, at 9:43 AM, Klymak Jody wrote: [xserve03.local][[61029,1],4][btl_tcp_endpoint.c: 486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[61029,1],3] This would well be caused by a version mismatch between your nodes. E.g., if one node is running OMPI vx.y.z and another is running va.b.c. We don't check for version mismatch in network communications, and our wire protocols have changed between versions. So if vx.y.z sends something that is not understood between va.b.c, something like the above message could occur. -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] torque pbs behaviour...
On 11-Aug-09, at 6:28 AM, Ralph Castain wrote: The reason your job is hanging is sitting in the orte-ps output. You have multiple processes declaring themselves to be the same MPI rank. That definitely won't work. Its the "local rank" if that makes any difference... Any thoughts on this output? [xserve03.local][[61029,1],4][btl_tcp_endpoint.c: 486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[61029,1],3] The question is why is that happening? We use Torque all the time, so we know that the basic support is correct. It -could- be related to lib confusion, but I can't tell for sure. Just to be clear, this is not going through torque at this point. Its just vanilla ssh, for which this code worked with 1.1.5. Can you rebuild OMPI with --enable-debug, and rerun the job with the following added to your cmd line? -mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5 Working on this... Thanks, Jody
Re: [OMPI users] torque pbs behaviour...
Oops - I should have looked at your output more closely. The component_find warnings are clearly indicating some old libs laying around, but that isn't why your job is hanging. The reason your job is hanging is sitting in the orte-ps output. You have multiple processes declaring themselves to be the same MPI rank. That definitely won't work. The question is why is that happening? We use Torque all the time, so we know that the basic support is correct. It -could- be related to lib confusion, but I can't tell for sure. Can you rebuild OMPI with --enable-debug, and rerun the job with the following added to your cmd line? -mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5 I'm afraid the output will be a tad verbose, but I would appreciate seeing it. Might also tell us something about the lib issue. Thanks Ralph On Tue, Aug 11, 2009 at 7:22 AM, Ralph Castainwrote: > Sorry, but Jeff is correct - that error message clearly indicates a version > mismatch. Somewhere, one or more of your nodes is still picking up an old > version. > > > > > On Tue, Aug 11, 2009 at 7:16 AM, Jeff Squyres wrote: > >> On Aug 11, 2009, at 9:11 AM, Klymak Jody wrote: >> >> I have removed all the OS-X -supplied libraries, recompiled and >>> installed openmpi 1.3.3, and I am *still* getting this warning when >>> running ompi_info: >>> >>> [saturna.cluster:50307] mca: base: component_find: iof "mca_iof_proxy" >>> uses an MCA interface that is not recognized (component MCA v1.0.0 != >>> supported MCA v2.0.0) -- ignored >>> >>> >> This means that OMPI is finding an mca_iof_proxy.la file at run time from >> a prior version of Open MPI. You might want to use "find" or "locate" to >> search your nodes and find it. I suspect that you somehow have an OMPI >> 1.3.x install that overlaid an install of a prior OMPI version installation. >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >
Re: [OMPI users] torque pbs behaviour...
Sorry, but Jeff is correct - that error message clearly indicates a version mismatch. Somewhere, one or more of your nodes is still picking up an old version. On Tue, Aug 11, 2009 at 7:16 AM, Jeff Squyreswrote: > On Aug 11, 2009, at 9:11 AM, Klymak Jody wrote: > > I have removed all the OS-X -supplied libraries, recompiled and >> installed openmpi 1.3.3, and I am *still* getting this warning when >> running ompi_info: >> >> [saturna.cluster:50307] mca: base: component_find: iof "mca_iof_proxy" >> uses an MCA interface that is not recognized (component MCA v1.0.0 != >> supported MCA v2.0.0) -- ignored >> >> > This means that OMPI is finding an mca_iof_proxy.la file at run time from > a prior version of Open MPI. You might want to use "find" or "locate" to > search your nodes and find it. I suspect that you somehow have an OMPI > 1.3.x install that overlaid an install of a prior OMPI version installation. > > -- > Jeff Squyres > jsquy...@cisco.com > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] torque pbs behaviour...
On Aug 11, 2009, at 9:11 AM, Klymak Jody wrote: I have removed all the OS-X -supplied libraries, recompiled and installed openmpi 1.3.3, and I am *still* getting this warning when running ompi_info: [saturna.cluster:50307] mca: base: component_find: iof "mca_iof_proxy" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored This means that OMPI is finding an mca_iof_proxy.la file at run time from a prior version of Open MPI. You might want to use "find" or "locate" to search your nodes and find it. I suspect that you somehow have an OMPI 1.3.x install that overlaid an install of a prior OMPI version installation. -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] torque pbs behaviour...
On 10-Aug-09, at 8:03 PM, Ralph Castain wrote: Interesting! Well, I always make sure I have my personal OMPI build before any system stuff, and I work exclusively on Mac OS-X: I am still finding this very mysterious I have removed all the OS-X -supplied libraries, recompiled and installed openmpi 1.3.3, and I am *still* getting this warning when running ompi_info: [saturna.cluster:50307] mca: base: component_find: iof "mca_iof_proxy" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [saturna.cluster:50307] mca: base: component_find: iof "mca_iof_svc" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [saturna.cluster:50307] mca: base: component_find: ras "mca_ras_dash_host" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [saturna.cluster:50307] mca: base: component_find: ras "mca_ras_hostfile" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [saturna.cluster:50307] mca: base: component_find: ras "mca_ras_localhost" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [saturna.cluster:50307] mca: base: component_find: ras "mca_ras_xgrid" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [saturna.cluster:50307] mca: base: component_find: rcache "mca_rcache_rb" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored So, I guess I'm not clear how the library can be an issue... I *do* get another error from running the gcm that I do not get from running simpler jobs - hopefully this helps explain things: [xserve03.local][[61029,1],4][btl_tcp_endpoint.c: 486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[61029,1],3] The processes are running, the mitgcmuv processes are running on the xserves, and using considerable resources! They open STDERR/STDOUT but nothing is flushed into them, including the few print statements I've put in before and after MPI_INIT (as Ralph suggested). On 11-Aug-09, at 4:17 AM, Ashley Pittman wrote: If you suspect a hang then you can use the command orte-ps (on the node where the mpirun is running) and it should show you your job. This will tell you if the job is started and still running or if there was a problem launching. /usr/local/openmpi/bin/orte-ps [saturna.cluster:51840] mca: base: component_find: iof "mca_iof_proxy" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [saturna.cluster:51840] mca: base: component_find: iof "mca_iof_svc" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored Information from mpirun [61029,0] --- JobID | State | Slots | Num Procs | -- [61029,1] | Running | 2 |16 | Process Name | ORTE Name | Local Rank |PID | Node | State | --- ../build/mitgcmuv | [[61029,1],0] | 0 | 40206 | xserve03 | Running | ../build/mitgcmuv | [[61029,1],1] | 0 | 40005 | xserve04 | Running | ../build/mitgcmuv | [[61029,1],2] | 1 | 40207 | xserve03 | Running | ../build/mitgcmuv | [[61029,1],3] | 1 | 40006 | xserve04 | Running | ../build/mitgcmuv | [[61029,1],4] | 2 | 40208 | xserve03 | Running | ../build/mitgcmuv | [[61029,1],5] | 2 | 40007 | xserve04 | Running | ../build/mitgcmuv | [[61029,1],6] | 3 | 40209 | xserve03 | Running | ../build/mitgcmuv | [[61029,1],7] | 3 | 40008 | xserve04 | Running | ../build/mitgcmuv | [[61029,1],8] | 4 | 40210 | xserve03 | Running | ../build/mitgcmuv | [[61029,1],9] | 4 | 40009 | xserve04 | Running | ../build/mitgcmuv | [[61029,1],10] | 5 | 40211 | xserve03 | Running | ../build/mitgcmuv | [[61029,1],11] | 5 | 40010 | xserve04 | Running | ../build/mitgcmuv | [[61029,1],12] | 6 | 40212 | xserve03 | Running | ../build/mitgcmuv | [[61029,1],13] | 6 | 40011 | xserve04 | Running | ../build/mitgcmuv | [[61029,1],14] | 7 | 40213 | xserve03 | Running | ../build/mitgcmuv | [[61029,1],15] | 7 | 40012 | xserve04 | Running | Thanks, Jody
Re: [OMPI users] torque pbs behaviour...
On Aug 11, 2009, at 5:17 AM, Ashley Pittman wrote: On Tue, 2009-08-11 at 03:03 -0600, Ralph Castain wrote: If it isn't already there, try putting a print statement tight at program start, another just prior to MPI_Init, and another just after MPI_Init. It could be that something is hanging somewhere during program startup since it sounds like everything is launching just fine. If you suspect a hang then you can use the command orte-ps (on the node where the mpirun is running) and it should show you your job. This will tell you if the job is started and still running or if there was a problem launching. If the program did start and has really hung then you can get more in-depth information about it using padb which is linked to in my signature. FWIW: we use padb for this purpose, and it is very helpful! Ralph Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] torque pbs behaviour...
On Tue, 2009-08-11 at 03:03 -0600, Ralph Castain wrote: > If it isn't already there, try putting a print statement tight at > program start, another just prior to MPI_Init, and another just after > MPI_Init. It could be that something is hanging somewhere during > program startup since it sounds like everything is launching just > fine. If you suspect a hang then you can use the command orte-ps (on the node where the mpirun is running) and it should show you your job. This will tell you if the job is started and still running or if there was a problem launching. If the program did start and has really hung then you can get more in-depth information about it using padb which is linked to in my signature. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI users] torque pbs behaviour...
On 10-Aug-09, at 8:03 PM, Ralph Castain wrote: Interesting! Well, I always make sure I have my personal OMPI build before any system stuff, and I work exclusively on Mac OS-X: Note that I always configure with --prefix=somewhere-in-my-own-dir, never to a system directory. Avoids this kind of confusion. Yeah, I did configure --prefix=/usr/local/openmpi What the errors are saying is that we are picking up components from a very old version of OMPI that is distributed by Apple. It may or may not be causing confusion for the system - hard to tell. However, the fact that it is the IO forwarding subsystem that is picking them up, and the fact that you aren't seeing any output from your job, makes me a tad suspicious. Me too! Can you run other jobs? In other words, do you get stdout/stderr from other programs you run, or does every MPI program hang (even simple ones)? If it is just your program, then it could just be that your application is hanging before any output is generated. Can you have it print something to stderr right when it starts? No simple ones, like the examples I gave before, run fine, just with the suspicious warnings. I'm running a big general circulation model (MITgcm). Under normal conditions it spits something out almost right away, and that is not being done here. STDOUT.0001 etc are all opened, but nothing is put into them. I'm pretty sure I'm compliling the gcm properly: otool -L mitgcmuv mitgcmuv: /usr/local/openmpi/lib/libmpi_f77.0.dylib (compatibility version 1.0.0, current version 1.0.0) /usr/local/openmpi/lib/libmpi.0.dylib (compatibility version 1.0.0, current version 1.0.0) /usr/local/openmpi/lib/libopen-rte.0.dylib (compatibility version 1.0.0, current version 1.0.0) /usr/local/openmpi/lib/libopen-pal.0.dylib (compatibility version 1.0.0, current version 1.0.0) /usr/lib/libutil.dylib (compatibility version 1.0.0, current version 1.0.0) /usr/local/lib/libgfortran.3.dylib (compatibility version 4.0.0, current version 4.0.0) /usr/local/lib/libgcc_s.1.dylib (compatibility version 1.0.0, current version 1.0.0) /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 111.1.3) Thanks, Jody On Aug 10, 2009, at 8:53 PM, Klymak Jody wrote: On 10-Aug-09, at 6:44 PM, Ralph Castain wrote: Check your LD_LIBRARY_PATH - there is an earlier version of OMPI in your path that is interfering with operation (i.e., it comes before your 1.3.3 installation). H, The OS X faq says not to do this: "Note that there is no need to add Open MPI's libdir to LD_LIBRARY_PATH; Open MPI's shared library build process automatically uses the "rpath" mechanism to automatically find the correct shared libraries (i.e., the ones associated with this build, vs., for example, the OS X-shipped OMPI shared libraries). Also note that we specifically do not recommend adding Open MPI's libdir to DYLD_LIBRARY_PATH." http://www.open-mpi.org/faq/?category=osx Regardless, if I set either, and run ompi_info I still get: [saturna.cluster:94981] mca: base: component_find: iof "mca_iof_proxy" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [saturna.cluster:94981] mca: base: component_find: iof "mca_iof_svc" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored echo $DYLD_LIBRARY_PATH $LD_LIBRARY_PATH /usr/local/openmpi/lib: /usr/local/openmpi/lib: So I'm afraid I'm stumped again. I suppose I could go clean out all the libraries in /usr/lib/... Thanks again, sorry to be a pain... Cheers, Jody On Aug 10, 2009, at 7:38 PM, Klymak Jody wrote: So, mpirun --display-allocation -pernode --display-map hostname gives me the output below. Simple jobs seem to run, but the MITgcm does not, either under ssh or torque. It hangs at some early point in execution before anything is written, so its hard for me to tell what the error is. Could these MCA warnings have anything to do with it? I've recompiled the gcm with -L /usr/local/openmpi/lib, so hopefully that catches the right library. Thanks, Jody [xserve02.local:38126] mca: base: component_find: ras "mca_ras_dash_host" uses an MCA interface that is not recogniz ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve02.local:38126] mca: base: component_find: ras "mca_ras_hostfile" uses an MCA interface that is not recognize d (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve02.local:38126] mca: base: component_find: ras "mca_ras_localhost" uses an MCA interface that is not recogniz ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve02.local:38126] mca: base: component_find: ras "mca_ras_xgrid" uses an MCA interface that is not recognized ( component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve02.local:38126] mca: base: component_find: iof "mca_iof_proxy"
Re: [OMPI users] torque pbs behaviour...
Interesting! Well, I always make sure I have my personal OMPI build before any system stuff, and I work exclusively on Mac OS-X: rhc$ echo $PATH /Library/Frameworks/Python.framework/Versions/Current/bin:/Users/rhc/ openmpi/bin:/Users/rhc/bin:/opt/local/bin:/usr/X11R6/bin:/usr/local/ bin:/opt/local/bin:/opt/local/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/ local/bin:/usr/texbin rhc$ echo $LD_LIBRARY_PATH /Users/rhc/openmpi/lib:/Users/rhc/lib:/opt/local/lib:/usr/X11R6/lib:/ usr/local/lib: Note that I always configure with --prefix=somewhere-in-my-own-dir, never to a system directory. Avoids this kind of confusion. What the errors are saying is that we are picking up components from a very old version of OMPI that is distributed by Apple. It may or may not be causing confusion for the system - hard to tell. However, the fact that it is the IO forwarding subsystem that is picking them up, and the fact that you aren't seeing any output from your job, makes me a tad suspicious. Can you run other jobs? In other words, do you get stdout/stderr from other programs you run, or does every MPI program hang (even simple ones)? If it is just your program, then it could just be that your application is hanging before any output is generated. Can you have it print something to stderr right when it starts? On Aug 10, 2009, at 8:53 PM, Klymak Jody wrote: On 10-Aug-09, at 6:44 PM, Ralph Castain wrote: Check your LD_LIBRARY_PATH - there is an earlier version of OMPI in your path that is interfering with operation (i.e., it comes before your 1.3.3 installation). H, The OS X faq says not to do this: "Note that there is no need to add Open MPI's libdir to LD_LIBRARY_PATH; Open MPI's shared library build process automatically uses the "rpath" mechanism to automatically find the correct shared libraries (i.e., the ones associated with this build, vs., for example, the OS X-shipped OMPI shared libraries). Also note that we specifically do not recommend adding Open MPI's libdir to DYLD_LIBRARY_PATH." http://www.open-mpi.org/faq/?category=osx Regardless, if I set either, and run ompi_info I still get: [saturna.cluster:94981] mca: base: component_find: iof "mca_iof_proxy" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [saturna.cluster:94981] mca: base: component_find: iof "mca_iof_svc" uses an MCA interface that is not recognized (component MCA v1.0.0 ! = supported MCA v2.0.0) -- ignored echo $DYLD_LIBRARY_PATH $LD_LIBRARY_PATH /usr/local/openmpi/lib: /usr/local/openmpi/lib: So I'm afraid I'm stumped again. I suppose I could go clean out all the libraries in /usr/lib/... Thanks again, sorry to be a pain... Cheers, Jody On Aug 10, 2009, at 7:38 PM, Klymak Jody wrote: So, mpirun --display-allocation -pernode --display-map hostname gives me the output below. Simple jobs seem to run, but the MITgcm does not, either under ssh or torque. It hangs at some early point in execution before anything is written, so its hard for me to tell what the error is. Could these MCA warnings have anything to do with it? I've recompiled the gcm with -L /usr/local/openmpi/lib, so hopefully that catches the right library. Thanks, Jody [xserve02.local:38126] mca: base: component_find: ras "mca_ras_dash_host" uses an MCA interface that is not recogniz ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve02.local:38126] mca: base: component_find: ras "mca_ras_hostfile" uses an MCA interface that is not recognize d (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve02.local:38126] mca: base: component_find: ras "mca_ras_localhost" uses an MCA interface that is not recogniz ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve02.local:38126] mca: base: component_find: ras "mca_ras_xgrid" uses an MCA interface that is not recognized ( component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve02.local:38126] mca: base: component_find: iof "mca_iof_proxy" uses an MCA interface that is not recognized ( component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve02.local:38126] mca: base: component_find: iof "mca_iof_svc" uses an MCA interface that is not recognized (co mponent MCA v1.0.0 != supported MCA v2.0.0) -- ignored == ALLOCATED NODES == Data for node: Name: xserve02.localNum slots: 8Max slots: 0 Data for node: Name: xserve01.localNum slots: 8Max slots: 0 = JOB MAP Data for node: Name: xserve02.localNum procs: 1 Process OMPI jobid: [20967,1] Process rank: 0 Data for node: Name: xserve01.localNum procs: 1 Process OMPI jobid: [20967,1] Process rank: 1 =
Re: [OMPI users] torque pbs behaviour...
On 10-Aug-09, at 6:44 PM, Ralph Castain wrote: Check your LD_LIBRARY_PATH - there is an earlier version of OMPI in your path that is interfering with operation (i.e., it comes before your 1.3.3 installation). H, The OS X faq says not to do this: "Note that there is no need to add Open MPI's libdir to LD_LIBRARY_PATH; Open MPI's shared library build process automatically uses the "rpath" mechanism to automatically find the correct shared libraries (i.e., the ones associated with this build, vs., for example, the OS X-shipped OMPI shared libraries). Also note that we specifically do not recommend adding Open MPI's libdir to DYLD_LIBRARY_PATH." http://www.open-mpi.org/faq/?category=osx Regardless, if I set either, and run ompi_info I still get: [saturna.cluster:94981] mca: base: component_find: iof "mca_iof_proxy" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [saturna.cluster:94981] mca: base: component_find: iof "mca_iof_svc" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored echo $DYLD_LIBRARY_PATH $LD_LIBRARY_PATH /usr/local/openmpi/lib: /usr/local/openmpi/lib: So I'm afraid I'm stumped again. I suppose I could go clean out all the libraries in /usr/lib/... Thanks again, sorry to be a pain... Cheers, Jody On Aug 10, 2009, at 7:38 PM, Klymak Jody wrote: So, mpirun --display-allocation -pernode --display-map hostname gives me the output below. Simple jobs seem to run, but the MITgcm does not, either under ssh or torque. It hangs at some early point in execution before anything is written, so its hard for me to tell what the error is. Could these MCA warnings have anything to do with it? I've recompiled the gcm with -L /usr/local/openmpi/lib, so hopefully that catches the right library. Thanks, Jody [xserve02.local:38126] mca: base: component_find: ras "mca_ras_dash_host" uses an MCA interface that is not recogniz ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve02.local:38126] mca: base: component_find: ras "mca_ras_hostfile" uses an MCA interface that is not recognize d (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve02.local:38126] mca: base: component_find: ras "mca_ras_localhost" uses an MCA interface that is not recogniz ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve02.local:38126] mca: base: component_find: ras "mca_ras_xgrid" uses an MCA interface that is not recognized ( component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve02.local:38126] mca: base: component_find: iof "mca_iof_proxy" uses an MCA interface that is not recognized ( component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve02.local:38126] mca: base: component_find: iof "mca_iof_svc" uses an MCA interface that is not recognized (co mponent MCA v1.0.0 != supported MCA v2.0.0) -- ignored == ALLOCATED NODES == Data for node: Name: xserve02.localNum slots: 8Max slots: 0 Data for node: Name: xserve01.localNum slots: 8Max slots: 0 = JOB MAP Data for node: Name: xserve02.localNum procs: 1 Process OMPI jobid: [20967,1] Process rank: 0 Data for node: Name: xserve01.localNum procs: 1 Process OMPI jobid: [20967,1] Process rank: 1 = [xserve01.cluster:38518] mca: base: component_find: iof "mca_iof_proxy" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve01.cluster:38518] mca: base: component_find: iof "mca_iof_svc" uses an MCA interface that is not recognized ( component MCA v1.0.0 != supported MCA v2.0.0) -- ignored xserve02.local xserve01.cluster ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] torque pbs behaviour...
So, mpirun --display-allocation -pernode --display-map hostname gives me the output below. Simple jobs seem to run, but the MITgcm does not, either under ssh or torque. It hangs at some early point in execution before anything is written, so its hard for me to tell what the error is. Could these MCA warnings have anything to do with it? I've recompiled the gcm with -L /usr/local/openmpi/lib, so hopefully that catches the right library. Thanks, Jody [xserve02.local:38126] mca: base: component_find: ras "mca_ras_dash_host" uses an MCA interface that is not recogniz ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve02.local:38126] mca: base: component_find: ras "mca_ras_hostfile" uses an MCA interface that is not recognize d (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve02.local:38126] mca: base: component_find: ras "mca_ras_localhost" uses an MCA interface that is not recogniz ed (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve02.local:38126] mca: base: component_find: ras "mca_ras_xgrid" uses an MCA interface that is not recognized ( component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve02.local:38126] mca: base: component_find: iof "mca_iof_proxy" uses an MCA interface that is not recognized ( component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve02.local:38126] mca: base: component_find: iof "mca_iof_svc" uses an MCA interface that is not recognized (co mponent MCA v1.0.0 != supported MCA v2.0.0) -- ignored == ALLOCATED NODES == Data for node: Name: xserve02.localNum slots: 8Max slots: 0 Data for node: Name: xserve01.localNum slots: 8Max slots: 0 = JOB MAP Data for node: Name: xserve02.localNum procs: 1 Process OMPI jobid: [20967,1] Process rank: 0 Data for node: Name: xserve01.localNum procs: 1 Process OMPI jobid: [20967,1] Process rank: 1 = [xserve01.cluster:38518] mca: base: component_find: iof "mca_iof_proxy" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve01.cluster:38518] mca: base: component_find: iof "mca_iof_svc" uses an MCA interface that is not recognized ( component MCA v1.0.0 != supported MCA v2.0.0) -- ignored xserve02.local xserve01.cluster
Re: [OMPI users] torque pbs behaviour...
No problem - actually, that default works with any environment, not just Torque On Aug 10, 2009, at 4:37 PM, Gus Correa wrote: Thank you for the correction, Ralph. I didn't know there was a (wise) default for the number of processes when using Torque-enabled OpenMPI. Gus Correa Ralph Castain wrote: Just to correct something said here. You need to tell mpirun how many processes to launch, regardless of whether you are using Torque or not. This is not correct. If you don't tell mpirun how many processes to launch, we will automatically launch one process for every slot in your allocation. In the case described here, there were 16 slots allocated, so we would automatically launch 16 processes. Ralph On Aug 10, 2009, at 3:47 PM, Gus Correa wrote: Hi Jody, list See comments inline. Jody Klymak wrote: On Aug 10, 2009, at 13:01 PM, Gus Correa wrote: Hi Jody We don't have Mac OS-X, but Linux, not sure if this applies to you. Did you configure your OpenMPI with Torque support, and pointed to the same library that provides the Torque you are using (--with-tm=/path/to/torque-library- directory)? Not explicitly. I'll check into that 1) If you don't do it explicitly, configure will use the first libtorque it finds (and that works I presume), which may/may not be the one you want, if you have more than one. If you only have one version of Torque installed, this shouldn't be the problem. 2) Have you tried something very simple, like the examples/hello_c.c program, to test the Torque-OpenMPI integration? 3) Also, just in case, put a "cat $PBS_NODEFILE" inside your script, before mpirun, to see what it reports. For "#PBS -l nodes=2:ppn=8" it should show 16 lines, 8 with the name of each node. 4) Finally, just to make sure the syntax is right. On your message you wrote: >>> If I submit openMPI with: >>> #PBS -l nodes=2:ppn=8 >>> mpirun MyProg Is this the real syntax you used? Or was it perhaps: #PBS -l nodes=2:ppn=8 mpirun -n 16 MyProg You need to tell mpirun how many processes to launch, regardless of whether you are using Torque or not. My $0.02 Gus Correa - Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA - Are you using the right mpirun? (There are so many out there.) yeah - I use the explicit path and moved the OS X one. Thanks! Jody Gus Correa - Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA - Jody Klymak wrote: Hi All, I've been trying to get torque pbs to work on my OS X 10.5.7 cluster with openMPI (after finding that Xgrid was pretty flaky about connections). I *think* this is an MPI problem (perhaps via operator error!) If I submit openMPI with: #PBS -l nodes=2:ppn=8 mpirun MyProg pbs locks off two of the processors, checked via "pbsnodes -a", and the job output. But mpirun runs the whole job on the second of the two processors. If I run the same job w/o qsub (i.e. using ssh) mpirun -n 16 -host xserve01,xserve02 MyProg it runs fine on all the nodes My /var/spool/toque/server_priv/nodes file looks like: xserve01.local np=8 xserve02.local np=8 Any idea what could be going wrong or how to debu this properly? There is nothing suspicious in the server or mom logs. Thanks for any help, Jody -- Jody Klymak http://web.uvic.ca/~jklymak/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jody Klymak http://web.uvic.ca/~jklymak/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] torque pbs behaviour...
Thank you for the correction, Ralph. I didn't know there was a (wise) default for the number of processes when using Torque-enabled OpenMPI. Gus Correa Ralph Castain wrote: Just to correct something said here. You need to tell mpirun how many processes to launch, regardless of whether you are using Torque or not. This is not correct. If you don't tell mpirun how many processes to launch, we will automatically launch one process for every slot in your allocation. In the case described here, there were 16 slots allocated, so we would automatically launch 16 processes. Ralph On Aug 10, 2009, at 3:47 PM, Gus Correa wrote: Hi Jody, list See comments inline. Jody Klymak wrote: On Aug 10, 2009, at 13:01 PM, Gus Correa wrote: Hi Jody We don't have Mac OS-X, but Linux, not sure if this applies to you. Did you configure your OpenMPI with Torque support, and pointed to the same library that provides the Torque you are using (--with-tm=/path/to/torque-library-directory)? Not explicitly. I'll check into that 1) If you don't do it explicitly, configure will use the first libtorque it finds (and that works I presume), which may/may not be the one you want, if you have more than one. If you only have one version of Torque installed, this shouldn't be the problem. 2) Have you tried something very simple, like the examples/hello_c.c program, to test the Torque-OpenMPI integration? 3) Also, just in case, put a "cat $PBS_NODEFILE" inside your script, before mpirun, to see what it reports. For "#PBS -l nodes=2:ppn=8" it should show 16 lines, 8 with the name of each node. 4) Finally, just to make sure the syntax is right. On your message you wrote: >>> If I submit openMPI with: >>> #PBS -l nodes=2:ppn=8 >>> mpirun MyProg Is this the real syntax you used? Or was it perhaps: #PBS -l nodes=2:ppn=8 mpirun -n 16 MyProg You need to tell mpirun how many processes to launch, regardless of whether you are using Torque or not. My $0.02 Gus Correa - Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA - Are you using the right mpirun? (There are so many out there.) yeah - I use the explicit path and moved the OS X one. Thanks! Jody Gus Correa - Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA - Jody Klymak wrote: Hi All, I've been trying to get torque pbs to work on my OS X 10.5.7 cluster with openMPI (after finding that Xgrid was pretty flaky about connections). I *think* this is an MPI problem (perhaps via operator error!) If I submit openMPI with: #PBS -l nodes=2:ppn=8 mpirun MyProg pbs locks off two of the processors, checked via "pbsnodes -a", and the job output. But mpirun runs the whole job on the second of the two processors. If I run the same job w/o qsub (i.e. using ssh) mpirun -n 16 -host xserve01,xserve02 MyProg it runs fine on all the nodes My /var/spool/toque/server_priv/nodes file looks like: xserve01.local np=8 xserve02.local np=8 Any idea what could be going wrong or how to debu this properly? There is nothing suspicious in the server or mom logs. Thanks for any help, Jody -- Jody Klymak http://web.uvic.ca/~jklymak/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jody Klymak http://web.uvic.ca/~jklymak/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] torque pbs behaviour...
No problem - yes indeed, 1.1.x would be a bad choice :-) On Aug 10, 2009, at 3:58 PM, Jody Klymak wrote: On Aug 10, 2009, at 14:39 PM, Ralph Castain wrote: mpirun --display-allocation -pernode --display-map hostname Ummm, hmm, this is embarassing, none of those command line arguments worked, making me suspicious... It looks like somehow I decided to build and run openMPI 1.1.5, or at least mpirun --version tells me that is the mpirun version. I'll get back to you when I get time to rebuild with 1.3.3. Could be that this is the source of my xgrid problems as well. Sorry for the noise. I'll get back to you if I still have problems... Thanks, Jody -- Jody Klymak http://web.uvic.ca/~jklymak/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] torque pbs behaviour...
On Aug 10, 2009, at 14:39 PM, Ralph Castain wrote: mpirun --display-allocation -pernode --display-map hostname Ummm, hmm, this is embarassing, none of those command line arguments worked, making me suspicious... It looks like somehow I decided to build and run openMPI 1.1.5, or at least mpirun --version tells me that is the mpirun version. I'll get back to you when I get time to rebuild with 1.3.3. Could be that this is the source of my xgrid problems as well. Sorry for the noise. I'll get back to you if I still have problems... Thanks, Jody -- Jody Klymak http://web.uvic.ca/~jklymak/
Re: [OMPI users] torque pbs behaviour...
Hi Jody, list See comments inline. Jody Klymak wrote: On Aug 10, 2009, at 13:01 PM, Gus Correa wrote: Hi Jody We don't have Mac OS-X, but Linux, not sure if this applies to you. Did you configure your OpenMPI with Torque support, and pointed to the same library that provides the Torque you are using (--with-tm=/path/to/torque-library-directory)? Not explicitly. I'll check into that 1) If you don't do it explicitly, configure will use the first libtorque it finds (and that works I presume), which may/may not be the one you want, if you have more than one. If you only have one version of Torque installed, this shouldn't be the problem. 2) Have you tried something very simple, like the examples/hello_c.c program, to test the Torque-OpenMPI integration? 3) Also, just in case, put a "cat $PBS_NODEFILE" inside your script, before mpirun, to see what it reports. For "#PBS -l nodes=2:ppn=8" it should show 16 lines, 8 with the name of each node. 4) Finally, just to make sure the syntax is right. On your message you wrote: >>> If I submit openMPI with: >>> #PBS -l nodes=2:ppn=8 >>> mpirun MyProg Is this the real syntax you used? Or was it perhaps: #PBS -l nodes=2:ppn=8 mpirun -n 16 MyProg You need to tell mpirun how many processes to launch, regardless of whether you are using Torque or not. My $0.02 Gus Correa - Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA - Are you using the right mpirun? (There are so many out there.) yeah - I use the explicit path and moved the OS X one. Thanks! Jody Gus Correa - Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA - Jody Klymak wrote: Hi All, I've been trying to get torque pbs to work on my OS X 10.5.7 cluster with openMPI (after finding that Xgrid was pretty flaky about connections). I *think* this is an MPI problem (perhaps via operator error!) If I submit openMPI with: #PBS -l nodes=2:ppn=8 mpirun MyProg pbs locks off two of the processors, checked via "pbsnodes -a", and the job output. But mpirun runs the whole job on the second of the two processors. If I run the same job w/o qsub (i.e. using ssh) mpirun -n 16 -host xserve01,xserve02 MyProg it runs fine on all the nodes My /var/spool/toque/server_priv/nodes file looks like: xserve01.local np=8 xserve02.local np=8 Any idea what could be going wrong or how to debu this properly? There is nothing suspicious in the server or mom logs. Thanks for any help, Jody -- Jody Klymak http://web.uvic.ca/~jklymak/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jody Klymak http://web.uvic.ca/~jklymak/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] torque pbs behaviour...
On Aug 10, 2009, at 3:25 PM, Jody Klymak wrote: Hi Ralph, On Aug 10, 2009, at 13:04 PM, Ralph Castain wrote: Umm...are you saying that your $PBS_NODEFILE contains the following: No, if I put cat $PBS_NODEFILE in the pbs script I get xserve02.local ... xserve02.local xserve01.local ... xserve01.local each repeated 8 times. So that seems to be working Good! xserve01.local np=8 xserve02.local np=8 If so, that could be part of the problem - it isn't the standard notation we are expecting to see in that file. What Torque normally provides is one line for each slot, so we would expect to see "xserve01.local" repeated 8 times, followed by "xserve02.local" repeated 8 times. Given the different syntax, we may not be parsing the file correctly. How was this file created? The file I am referring to above is the $TORQUEHOME/server_priv/ nodes file, that I created it by hand based on my understanding of the docs at: http://www.clusterresources.com/torquedocs/nodeconfig.shtml OMPI doesn't care about that file - only Torque looks at it. Also, could you clarify what node mpirun is executing on? mpirun seems to only run on xserve02 The job I'm running is just creating a file: #!/bin/bash H=`hostname` echo $H sleep 10 uptime >& $H.txt In the stdout, the echo $H returns "xserve02.local" 16 times and only xsever02.local.txt gets created... Again, if I run with "ssh" outside of pbs I get the expected response. Try running: mpirun --display-allocation -pernode --display-map hostname This will tell us what OMPI is seeing in terms of the nodes available to it. Based on what you show above, it should see both of your nodes. By forcing OMPI to put one proc/node, you'll be directing it to use both nodes for the job. You should see this in the reported map. If we then see both procs run on the same node, I would suggest reconfiguring OMPI with --enable-debug, and then rerunning the above command with an additional flag: -mca plm_base_verbose 5 which will show us all the ugly details of what OMPI is telling Torque to do. Since OMPI works fine with Torque on Linux, my guess is that there is something about the Torque for Mac that is a little different and thus causing problems. Ralph Thanks, Jody Ralph On Aug 10, 2009, at 1:43 PM, Jody Klymak wrote: Hi All, I've been trying to get torque pbs to work on my OS X 10.5.7 cluster with openMPI (after finding that Xgrid was pretty flaky about connections). I *think* this is an MPI problem (perhaps via operator error!) If I submit openMPI with: #PBS -l nodes=2:ppn=8 mpirun MyProg pbs locks off two of the processors, checked via "pbsnodes -a", and the job output. But mpirun runs the whole job on the second of the two processors. If I run the same job w/o qsub (i.e. using ssh) mpirun -n 16 -host xserve01,xserve02 MyProg it runs fine on all the nodes My /var/spool/toque/server_priv/nodes file looks like: xserve01.local np=8 xserve02.local np=8 Any idea what could be going wrong or how to debu this properly? There is nothing suspicious in the server or mom logs. Thanks for any help, Jody -- Jody Klymak http://web.uvic.ca/~jklymak/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jody Klymak http://web.uvic.ca/~jklymak/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] torque pbs behaviour...
Hi Ralph, On Aug 10, 2009, at 13:04 PM, Ralph Castain wrote: Umm...are you saying that your $PBS_NODEFILE contains the following: No, if I put cat $PBS_NODEFILE in the pbs script I get xserve02.local ... xserve02.local xserve01.local ... xserve01.local each repeated 8 times. So that seems to be working xserve01.local np=8 xserve02.local np=8 If so, that could be part of the problem - it isn't the standard notation we are expecting to see in that file. What Torque normally provides is one line for each slot, so we would expect to see "xserve01.local" repeated 8 times, followed by "xserve02.local" repeated 8 times. Given the different syntax, we may not be parsing the file correctly. How was this file created? The file I am referring to above is the $TORQUEHOME/server_priv/nodes file, that I created it by hand based on my understanding of the docs at: http://www.clusterresources.com/torquedocs/nodeconfig.shtml Also, could you clarify what node mpirun is executing on? mpirun seems to only run on xserve02 The job I'm running is just creating a file: #!/bin/bash H=`hostname` echo $H sleep 10 uptime >& $H.txt In the stdout, the echo $H returns "xserve02.local" 16 times and only xsever02.local.txt gets created... Again, if I run with "ssh" outside of pbs I get the expected response. Thanks, Jody Ralph On Aug 10, 2009, at 1:43 PM, Jody Klymak wrote: Hi All, I've been trying to get torque pbs to work on my OS X 10.5.7 cluster with openMPI (after finding that Xgrid was pretty flaky about connections). I *think* this is an MPI problem (perhaps via operator error!) If I submit openMPI with: #PBS -l nodes=2:ppn=8 mpirun MyProg pbs locks off two of the processors, checked via "pbsnodes -a", and the job output. But mpirun runs the whole job on the second of the two processors. If I run the same job w/o qsub (i.e. using ssh) mpirun -n 16 -host xserve01,xserve02 MyProg it runs fine on all the nodes My /var/spool/toque/server_priv/nodes file looks like: xserve01.local np=8 xserve02.local np=8 Any idea what could be going wrong or how to debu this properly? There is nothing suspicious in the server or mom logs. Thanks for any help, Jody -- Jody Klymak http://web.uvic.ca/~jklymak/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jody Klymak http://web.uvic.ca/~jklymak/
Re: [OMPI users] torque pbs behaviour...
On Aug 10, 2009, at 13:01 PM, Gus Correa wrote: Hi Jody We don't have Mac OS-X, but Linux, not sure if this applies to you. Did you configure your OpenMPI with Torque support, and pointed to the same library that provides the Torque you are using (--with-tm=/path/to/torque-library-directory)? Not explicitly. I'll check into that Are you using the right mpirun? (There are so many out there.) yeah - I use the explicit path and moved the OS X one. Thanks! Jody Gus Correa - Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA - Jody Klymak wrote: Hi All, I've been trying to get torque pbs to work on my OS X 10.5.7 cluster with openMPI (after finding that Xgrid was pretty flaky about connections). I *think* this is an MPI problem (perhaps via operator error!) If I submit openMPI with: #PBS -l nodes=2:ppn=8 mpirun MyProg pbs locks off two of the processors, checked via "pbsnodes -a", and the job output. But mpirun runs the whole job on the second of the two processors. If I run the same job w/o qsub (i.e. using ssh) mpirun -n 16 -host xserve01,xserve02 MyProg it runs fine on all the nodes My /var/spool/toque/server_priv/nodes file looks like: xserve01.local np=8 xserve02.local np=8 Any idea what could be going wrong or how to debu this properly? There is nothing suspicious in the server or mom logs. Thanks for any help, Jody -- Jody Klymak http://web.uvic.ca/~jklymak/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jody Klymak http://web.uvic.ca/~jklymak/
Re: [OMPI users] torque pbs behaviour...
Umm...are you saying that your $PBS_NODEFILE contains the following: xserve01.local np=8 xserve02.local np=8 If so, that could be part of the problem - it isn't the standard notation we are expecting to see in that file. What Torque normally provides is one line for each slot, so we would expect to see "xserve01.local" repeated 8 times, followed by "xserve02.local" repeated 8 times. Given the different syntax, we may not be parsing the file correctly. How was this file created? Also, could you clarify what node mpirun is executing on? Ralph On Aug 10, 2009, at 1:43 PM, Jody Klymak wrote: Hi All, I've been trying to get torque pbs to work on my OS X 10.5.7 cluster with openMPI (after finding that Xgrid was pretty flaky about connections). I *think* this is an MPI problem (perhaps via operator error!) If I submit openMPI with: #PBS -l nodes=2:ppn=8 mpirun MyProg pbs locks off two of the processors, checked via "pbsnodes -a", and the job output. But mpirun runs the whole job on the second of the two processors. If I run the same job w/o qsub (i.e. using ssh) mpirun -n 16 -host xserve01,xserve02 MyProg it runs fine on all the nodes My /var/spool/toque/server_priv/nodes file looks like: xserve01.local np=8 xserve02.local np=8 Any idea what could be going wrong or how to debu this properly? There is nothing suspicious in the server or mom logs. Thanks for any help, Jody -- Jody Klymak http://web.uvic.ca/~jklymak/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users