Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)
> Well, does `mpiexec` point to the correct one? I don't really get this. I only installed one and only one OpenMPI on the node. There shouldn't have another 'mpiexec' on the system. It's worthy to mention that every node is deployed from a master image. So everything is exactly the same except IP and DNS name. > I thought you compiled it on your own with --with-sge. What about: pwbcad@sgeqexec01:~$ ompi_info | grep grid MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.1) Is there any location I can find a more meaningful OpenMPI log? I will try to install openmpi 1.4.3 and see if that works. I want to confirm one more thing: does SGE's master host need to have OpenMPI installed? Is it relevant? Many thanks Reuti Derrick
Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)
Am 16.04.2011 um 23:09 schrieb Derrick LIN: > So you route the SGE startup mechanism to use `ssh`, nevertherless is should > work of course. Small difference to a conventional `ssh` is, that SGE will > start a private daemon for each job on the nodes listening on a random port. > > When you use only one host, then forks will be created but no `ssh` call. > Your test uses more than one node? > > I have tested with more than one node but the error still happened. > > You copied you SGE aware version to all nodes at the same location? Are you > getting the correct `mpiexec` and shared libraries in your jobscript? Shows > the output of: > > I installed it from the ubuntu apt-get on each node, so the OpenMPI is in > standard location. In fact ubuntu handles all dependencies very well without > worrying about PATH or LD_LIBRARY_PATH. Well, does `mpiexec` point to the correct one? I thought you compiled it on your own with --with-sge. What about: $ ompi_info | grep grid MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.3) You have this on all nodes and your binary was compiled with this version? All stuff below looks fine. You can even try to start "from scratch" with a private copy of Open MPI which you install for example in $HOME/local/openmpi-1.4.3 and set the paths accordingly. -- Reuti > #!/bin/sh > which mpiexec > echo $LD_LIBRARY_PATH > ldd ompi_job > > the expected ones (ompi_job is the binary and ompi_job.sh the script) when > submitted with a PE request? > > /usr/bin/mpiexec > /usr/lib/openmpi/lib:/usr/lib/openmpi/lib/openmpi > linux-vdso.so.1 => (0x7fff9b1ff000) > libmpi.so.0 => /usr/lib/libmpi.so.0 (0x2af0868aa000) > libopen-rte.so.0 => /usr/lib/libopen-rte.so.0 (0x2af086b58000) > libopen-pal.so.0 => /usr/lib/libopen-pal.so.0 (0x2af086da4000) > libdl.so.2 => /lib/libdl.so.2 (0x2af087017000) > libnsl.so.1 => /lib/libnsl.so.1 (0x2af08721b000) > libutil.so.1 => /lib/libutil.so.1 (0x2af087436000) > libm.so.6 => /lib/libm.so.6 (0x2af087639000) > libpthread.so.0 => /lib/libpthread.so.0 (0x2af0878bc000) > libc.so.6 => /lib/libc.so.6 (0x2af087ada000) > /lib64/ld-linux-x86-64.so.2 (0x2af086687000) > > Below are some runtime data inside a job spooling directory on the execution > host > > pwbcad@sgeqexec01:128.1$ ls > addgrpid config environment error exit_status job_pid pe_hostfile pid > trace usage > pwbcad@sgeqexec01:128.1$ cat config > add_grp_id=65416 > fs_stdin_host="" > fs_stdin_path= > fs_stdin_tmp_path=/tmp/128.1.dev.q/ > fs_stdin_file_staging=0 > fs_stdout_host="" > fs_stdout_path= > fs_stdout_tmp_path=/tmp/128.1.dev.q/ > fs_stdout_file_staging=0 > fs_stderr_host="" > fs_stderr_path= > fs_stderr_tmp_path=/tmp/128.1.dev.q/ > fs_stderr_file_staging=0 > stdout_path=/mnt/FacilityBioinformatics/pwbcad > stderr_path=/mnt/FacilityBioinformatics/pwbcad > stdin_path=/dev/null > merge_stderr=1 > tmpdir=/tmp/128.1.dev.q > handle_as_binary=0 > no_shell=0 > ckpt_job=0 > h_vmem=INFINITY > h_vmem_is_consumable_job=0 > s_vmem=INFINITY > s_vmem_is_consumable_job=0 > h_cpu=INFINITY > h_cpu_is_consumable_job=0 > s_cpu=INFINITY > s_cpu_is_consumable_job=0 > h_stack=INFINITY > h_stack_is_consumable_job=0 > s_stack=INFINITY > s_stack_is_consumable_job=0 > h_data=INFINITY > h_data_is_consumable_job=0 > s_data=INFINITY > s_data_is_consumable_job=0 > h_core=INFINITY > s_core=INFINITY > h_rss=INFINITY > s_rss=INFINITY > h_fsize=INFINITY > s_fsize=INFINITY > s_descriptors=UNDEFINED > h_descriptors=UNDEFINED > s_maxproc=UNDEFINED > h_maxproc=UNDEFINED > s_memorylocked=UNDEFINED > h_memorylocked=UNDEFINED > s_locks=UNDEFINED > h_locks=UNDEFINED > priority=0 > shell_path=/bin/bash > script_file=/var/spool/gridengine/execd/sgeqexec01/job_scripts/128 > job_owner=pwbcad > min_gid=0 > min_uid=0 > cwd=/mnt/FacilityBioinformatics/pwbcad > prolog=none > epilog=none > starter_method=NONE > suspend_method=NONE > resume_method=NONE > terminate_method=NONE > script_timeout=120 > pe=orte > pe_slots=16 > host_slots=8 > pe_hostfile=/var/spool/gridengine/execd/sgeqexec01/active_jobs/128.1/pe_hostfile > pe_start=/bin/true > pe_stop=/bin/true > pe_stdout_path=/mnt/FacilityBioinformatics/pwbcad > pe_stderr_path=/mnt/FacilityBioinformatics/pwbcad > shell_start_mode=posix_compliant > use_login_shell=1 > mail_list=pwb...@enzo.garvan.unsw.edu.au > mail_options=0 > forbid_reschedule=0 > forbid_apperror=0 > queue=dev.q > host=sgeqexec01.garvan.unsw.edu.au > processors=UNDEFINED > binding=NULL > job_name=run_cal_pi_auto > job_id=128 > ja_task_id=0 > account=sge > submission_time=1302987873 > notify=0 > acct_project=none > njob_args=0 > queue_tmpdir=/tmp > use_afs=0 > admin_user=sgeadmin > notify_kill_type=1 > notify_kill=default > notify_susp_type=1 > notify_susp=default > qsub_gid=no > pty=0 > write_osjob_id=1 > inherit_env=1 > enable_windomacc=0 >
Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)
> > So you route the SGE startup mechanism to use `ssh`, nevertherless is > should work of course. Small difference to a conventional `ssh` is, that SGE > will start a private daemon for each job on the nodes listening on a random > port. > > When you use only one host, then forks will be created but no `ssh` call. > Your test uses more than one node? > I have tested with more than one node but the error still happened. You copied you SGE aware version to all nodes at the same location? Are you > getting the correct `mpiexec` and shared libraries in your jobscript? Shows > the output of: > I installed it from the ubuntu apt-get on each node, so the OpenMPI is in standard location. In fact ubuntu handles all dependencies very well without worrying about PATH or LD_LIBRARY_PATH. > #!/bin/sh > which mpiexec > echo $LD_LIBRARY_PATH > ldd ompi_job > > the expected ones (ompi_job is the binary and ompi_job.sh the script) when > submitted with a PE request? > /usr/bin/mpiexec /usr/lib/openmpi/lib:/usr/lib/openmpi/lib/openmpi linux-vdso.so.1 => (0x7fff9b1ff000) libmpi.so.0 => /usr/lib/libmpi.so.0 (0x2af0868aa000) libopen-rte.so.0 => /usr/lib/libopen-rte.so.0 (0x2af086b58000) libopen-pal.so.0 => /usr/lib/libopen-pal.so.0 (0x2af086da4000) libdl.so.2 => /lib/libdl.so.2 (0x2af087017000) libnsl.so.1 => /lib/libnsl.so.1 (0x2af08721b000) libutil.so.1 => /lib/libutil.so.1 (0x2af087436000) libm.so.6 => /lib/libm.so.6 (0x2af087639000) libpthread.so.0 => /lib/libpthread.so.0 (0x2af0878bc000) libc.so.6 => /lib/libc.so.6 (0x2af087ada000) /lib64/ld-linux-x86-64.so.2 (0x2af086687000) Below are some runtime data inside a job spooling directory on the execution host pwbcad@sgeqexec01:128.1$ ls addgrpid config environment error exit_status job_pid pe_hostfile pid trace usage *pwbcad@sgeqexec01:128.1$ cat config* add_grp_id=65416 fs_stdin_host="" fs_stdin_path= fs_stdin_tmp_path=/tmp/128.1.dev.q/ fs_stdin_file_staging=0 fs_stdout_host="" fs_stdout_path= fs_stdout_tmp_path=/tmp/128.1.dev.q/ fs_stdout_file_staging=0 fs_stderr_host="" fs_stderr_path= fs_stderr_tmp_path=/tmp/128.1.dev.q/ fs_stderr_file_staging=0 stdout_path=/mnt/FacilityBioinformatics/pwbcad stderr_path=/mnt/FacilityBioinformatics/pwbcad stdin_path=/dev/null merge_stderr=1 tmpdir=/tmp/128.1.dev.q handle_as_binary=0 no_shell=0 ckpt_job=0 h_vmem=INFINITY h_vmem_is_consumable_job=0 s_vmem=INFINITY s_vmem_is_consumable_job=0 h_cpu=INFINITY h_cpu_is_consumable_job=0 s_cpu=INFINITY s_cpu_is_consumable_job=0 h_stack=INFINITY h_stack_is_consumable_job=0 s_stack=INFINITY s_stack_is_consumable_job=0 h_data=INFINITY h_data_is_consumable_job=0 s_data=INFINITY s_data_is_consumable_job=0 h_core=INFINITY s_core=INFINITY h_rss=INFINITY s_rss=INFINITY h_fsize=INFINITY s_fsize=INFINITY s_descriptors=UNDEFINED h_descriptors=UNDEFINED s_maxproc=UNDEFINED h_maxproc=UNDEFINED s_memorylocked=UNDEFINED h_memorylocked=UNDEFINED s_locks=UNDEFINED h_locks=UNDEFINED priority=0 shell_path=/bin/bash script_file=/var/spool/gridengine/execd/sgeqexec01/job_scripts/128 job_owner=pwbcad min_gid=0 min_uid=0 cwd=/mnt/FacilityBioinformatics/pwbcad prolog=none epilog=none starter_method=NONE suspend_method=NONE resume_method=NONE terminate_method=NONE script_timeout=120 pe=orte pe_slots=16 host_slots=8 pe_hostfile=/var/spool/gridengine/execd/sgeqexec01/active_jobs/128.1/pe_hostfile pe_start=/bin/true pe_stop=/bin/true pe_stdout_path=/mnt/FacilityBioinformatics/pwbcad pe_stderr_path=/mnt/FacilityBioinformatics/pwbcad shell_start_mode=posix_compliant use_login_shell=1 mail_list=pwb...@enzo.garvan.unsw.edu.au mail_options=0 forbid_reschedule=0 forbid_apperror=0 queue=dev.q host=sgeqexec01.garvan.unsw.edu.au processors=UNDEFINED binding=NULL job_name=run_cal_pi_auto job_id=128 ja_task_id=0 account=sge submission_time=1302987873 notify=0 acct_project=none njob_args=0 queue_tmpdir=/tmp use_afs=0 admin_user=sgeadmin notify_kill_type=1 notify_kill=default notify_susp_type=1 notify_susp=default qsub_gid=no pty=0 write_osjob_id=1 inherit_env=1 enable_windomacc=0 enable_addgrp_kill=0 csp=0 ignore_fqdn=0 default_domain=none *pwbcad@sgeqexec01:128.1$ cat environment* USER=pwbcad SSH_CLIENT=149.171.200.64 63056 22 MAIL=/var/mail/pwbcad SHLVL=1 OLDPWD=/home/pwbcad HOME=/home/pwbcad SSH_TTY=/dev/pts/4 PAGER=less PS1=\[\e[32;1m\]\u\[\e[0m\]@\[\e[35;1m\]\h\[\e[0m\]:\[\e[34;1m\]\W\[\e[0m\]\$ LOGNAME=pwbcad _=/usr/bin/qsub TERM=xterm SGE_ROOT=/var/lib/gridengine PATH=/tmp/128.1.dev.q:.:/home/pwbcad/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/meme/bin:/usr/local/eigenstrat:/usr/local/tophat/bin:/usr/local/cufflinks/bin:/usr/local/defuse/bin:/usr/local/bowtie/bin:/usr/local/cnvseq/bin:/usr/local/fastx_toolkit/bin:/usr/local/breakway/bin SGE_CELL=default LANG=en_AU.UTF-8 SHELL=/bin/bash PWD=/mnt/FacilityBioinformatics/pwbcad
Re: [OMPI users] missing symbols in Windows 1.5.3 binaries?
Shiqing, I'm using Composer XE2011 and Visual Studio 2008. VS2008 is doing the linking. I'll do a build of 1.5.3 myself and see how the symbols turn out. Damien On 16/04/2011 1:50 PM, Shiqing Fan wrote: Hi Damien, Which version of Intel MPI do you use? The only difference between 1.5.3 and .1.52 I can tell is that 1.5.3 was built with Intel Fortran Composer XE 2011. That might be the reason of the problem. Shiqing On 4/16/2011 4:01 AM, Damien wrote: Hiya, I just tested the 1.5.3 binaries and my link pass broke. Using 1.5.3 I get unresolved externals on things like _MPI_NULL_COPY_FN. On 1.5.2.2 it's fine. I did a dumpbin on libmpi.lib for both versions, and in 1.5.3 there's upper-case symbols for _OMPI_C_MPI_NULL_COPY_FN, but not _MPI_NULL_COPY_FN. In the 1.5.2.2 libmpi.lib there's symbols for both. Damien ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] missing symbols in Windows 1.5.3 binaries?
Hi Damien, Which version of Intel MPI do you use? The only difference between 1.5.3 and .1.52 I can tell is that 1.5.3 was built with Intel Fortran Composer XE 2011. That might be the reason of the problem. Shiqing On 4/16/2011 4:01 AM, Damien wrote: Hiya, I just tested the 1.5.3 binaries and my link pass broke. Using 1.5.3 I get unresolved externals on things like _MPI_NULL_COPY_FN. On 1.5.2.2 it's fine. I did a dumpbin on libmpi.lib for both versions, and in 1.5.3 there's upper-case symbols for _OMPI_C_MPI_NULL_COPY_FN, but not _MPI_NULL_COPY_FN. In the 1.5.2.2 libmpi.lib there's symbols for both. Damien ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- --- Shiqing Fan High Performance Computing Center Stuttgart (HLRS) Tel: ++49(0)711-685-87234 Nobelstrasse 19 Fax: ++49(0)711-685-65832 70569 Stuttgart http://www.hlrs.de/organization/people/shiqing-fan/ email: f...@hlrs.de
[OMPI users] Ofed v1.5.3?
Does OpenMPI v1.5.3 support Ofed v.1.5.3.1 ?