Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)
Am 17.04.2011 um 01:21 schrieb Derrick LIN: > > > Well, does `mpiexec` point to the correct one? > > I don't really get this. I only installed one and only one OpenMPI on the > node. There shouldn't have another 'mpiexec' on the system. It could be one from any other MPI implementation by accident. > It's worthy to mention that every node is deployed from a master image. So > everything is exactly the same except IP and DNS name. > > I thought you compiled it on your own with --with-sge. What about: > > pwbcad@sgeqexec01:~$ ompi_info | grep grid > MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.1) Fine. > Is there any location I can find a more meaningful OpenMPI log? Can you run a simple `mpiexec hostname` in the script? > I will try to install openmpi 1.4.3 and see if that works. > > I want to confirm one more thing: does SGE's master host need to have OpenMPI > installed? Is it relevant? In principle: no. But often it's installed too, as you will compile on either the master machine or a dedicated login server. -- Reuti > Many thanks Reuti > > Derrick > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)
I'm no SGE expert, but I do note that your original error indicates that mpirun was unable to find a launcher for your environment. When running under SGE, mpirun looks for certain environmental variables indicative of SGE. If it finds those, it then looks for the "qrsh" command. If it doesn't find "qrsh" and/or it isn't executable by the user, then you will fail with that error. Given that you have the envars, is "qrsh" in your path where mpirun is executing? If not, then that is the reason why you are able to run outside of SGE (where mpirun will default to using ssh) and not inside it. On Apr 16, 2011, at 5:21 PM, Derrick LIN wrote: > > > Well, does `mpiexec` point to the correct one? > > I don't really get this. I only installed one and only one OpenMPI on the > node. There shouldn't have another 'mpiexec' on the system. > > It's worthy to mention that every node is deployed from a master image. So > everything is exactly the same except IP and DNS name. > > I thought you compiled it on your own with --with-sge. What about: > > pwbcad@sgeqexec01:~$ ompi_info | grep grid > MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.1) > > Is there any location I can find a more meaningful OpenMPI log? > > I will try to install openmpi 1.4.3 and see if that works. > > I want to confirm one more thing: does SGE's master host need to have OpenMPI > installed? Is it relevant? > > Many thanks Reuti > > Derrick > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)
> Well, does `mpiexec` point to the correct one? I don't really get this. I only installed one and only one OpenMPI on the node. There shouldn't have another 'mpiexec' on the system. It's worthy to mention that every node is deployed from a master image. So everything is exactly the same except IP and DNS name. > I thought you compiled it on your own with --with-sge. What about: pwbcad@sgeqexec01:~$ ompi_info | grep grid MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.1) Is there any location I can find a more meaningful OpenMPI log? I will try to install openmpi 1.4.3 and see if that works. I want to confirm one more thing: does SGE's master host need to have OpenMPI installed? Is it relevant? Many thanks Reuti Derrick
Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)
Am 16.04.2011 um 23:09 schrieb Derrick LIN: > So you route the SGE startup mechanism to use `ssh`, nevertherless is should > work of course. Small difference to a conventional `ssh` is, that SGE will > start a private daemon for each job on the nodes listening on a random port. > > When you use only one host, then forks will be created but no `ssh` call. > Your test uses more than one node? > > I have tested with more than one node but the error still happened. > > You copied you SGE aware version to all nodes at the same location? Are you > getting the correct `mpiexec` and shared libraries in your jobscript? Shows > the output of: > > I installed it from the ubuntu apt-get on each node, so the OpenMPI is in > standard location. In fact ubuntu handles all dependencies very well without > worrying about PATH or LD_LIBRARY_PATH. Well, does `mpiexec` point to the correct one? I thought you compiled it on your own with --with-sge. What about: $ ompi_info | grep grid MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.3) You have this on all nodes and your binary was compiled with this version? All stuff below looks fine. You can even try to start "from scratch" with a private copy of Open MPI which you install for example in $HOME/local/openmpi-1.4.3 and set the paths accordingly. -- Reuti > #!/bin/sh > which mpiexec > echo $LD_LIBRARY_PATH > ldd ompi_job > > the expected ones (ompi_job is the binary and ompi_job.sh the script) when > submitted with a PE request? > > /usr/bin/mpiexec > /usr/lib/openmpi/lib:/usr/lib/openmpi/lib/openmpi > linux-vdso.so.1 => (0x7fff9b1ff000) > libmpi.so.0 => /usr/lib/libmpi.so.0 (0x2af0868aa000) > libopen-rte.so.0 => /usr/lib/libopen-rte.so.0 (0x2af086b58000) > libopen-pal.so.0 => /usr/lib/libopen-pal.so.0 (0x2af086da4000) > libdl.so.2 => /lib/libdl.so.2 (0x2af087017000) > libnsl.so.1 => /lib/libnsl.so.1 (0x2af08721b000) > libutil.so.1 => /lib/libutil.so.1 (0x2af087436000) > libm.so.6 => /lib/libm.so.6 (0x2af087639000) > libpthread.so.0 => /lib/libpthread.so.0 (0x2af0878bc000) > libc.so.6 => /lib/libc.so.6 (0x2af087ada000) > /lib64/ld-linux-x86-64.so.2 (0x2af086687000) > > Below are some runtime data inside a job spooling directory on the execution > host > > pwbcad@sgeqexec01:128.1$ ls > addgrpid config environment error exit_status job_pid pe_hostfile pid > trace usage > pwbcad@sgeqexec01:128.1$ cat config > add_grp_id=65416 > fs_stdin_host="" > fs_stdin_path= > fs_stdin_tmp_path=/tmp/128.1.dev.q/ > fs_stdin_file_staging=0 > fs_stdout_host="" > fs_stdout_path= > fs_stdout_tmp_path=/tmp/128.1.dev.q/ > fs_stdout_file_staging=0 > fs_stderr_host="" > fs_stderr_path= > fs_stderr_tmp_path=/tmp/128.1.dev.q/ > fs_stderr_file_staging=0 > stdout_path=/mnt/FacilityBioinformatics/pwbcad > stderr_path=/mnt/FacilityBioinformatics/pwbcad > stdin_path=/dev/null > merge_stderr=1 > tmpdir=/tmp/128.1.dev.q > handle_as_binary=0 > no_shell=0 > ckpt_job=0 > h_vmem=INFINITY > h_vmem_is_consumable_job=0 > s_vmem=INFINITY > s_vmem_is_consumable_job=0 > h_cpu=INFINITY > h_cpu_is_consumable_job=0 > s_cpu=INFINITY > s_cpu_is_consumable_job=0 > h_stack=INFINITY > h_stack_is_consumable_job=0 > s_stack=INFINITY > s_stack_is_consumable_job=0 > h_data=INFINITY > h_data_is_consumable_job=0 > s_data=INFINITY > s_data_is_consumable_job=0 > h_core=INFINITY > s_core=INFINITY > h_rss=INFINITY > s_rss=INFINITY > h_fsize=INFINITY > s_fsize=INFINITY > s_descriptors=UNDEFINED > h_descriptors=UNDEFINED > s_maxproc=UNDEFINED > h_maxproc=UNDEFINED > s_memorylocked=UNDEFINED > h_memorylocked=UNDEFINED > s_locks=UNDEFINED > h_locks=UNDEFINED > priority=0 > shell_path=/bin/bash > script_file=/var/spool/gridengine/execd/sgeqexec01/job_scripts/128 > job_owner=pwbcad > min_gid=0 > min_uid=0 > cwd=/mnt/FacilityBioinformatics/pwbcad > prolog=none > epilog=none > starter_method=NONE > suspend_method=NONE > resume_method=NONE > terminate_method=NONE > script_timeout=120 > pe=orte > pe_slots=16 > host_slots=8 > pe_hostfile=/var/spool/gridengine/execd/sgeqexec01/active_jobs/128.1/pe_hostfile > pe_start=/bin/true > pe_stop=/bin/true > pe_stdout_path=/mnt/FacilityBioinformatics/pwbcad > pe_stderr_path=/mnt/FacilityBioinformatics/pwbcad > shell_start_mode=posix_compliant > use_login_shell=1 > mail_list=pwb...@enzo.garvan.unsw.edu.au > mail_options=0 > forbid_reschedule=0 > forbid_apperror=0 > queue=dev.q > host=sgeqexec01.garvan.unsw.edu.au > processors=UNDEFINED > binding=NULL > job_name=run_cal_pi_auto > job_id=128 > ja_task_id=0 > account=sge > submission_time=1302987873 > notify=0 > acct_project=none > njob_args=0 > queue_tmpdir=/tmp > use_afs=0 > admin_user=sgeadmin > notify_kill_type=1 > notify_kill=default > notify_susp_type=1 > notify_susp=default > qsub_gid=no > pty=0 > write_osjob_id=1 > inherit_env=1 > enable_windomacc=0 > enable_addg
Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)
> > So you route the SGE startup mechanism to use `ssh`, nevertherless is > should work of course. Small difference to a conventional `ssh` is, that SGE > will start a private daemon for each job on the nodes listening on a random > port. > > When you use only one host, then forks will be created but no `ssh` call. > Your test uses more than one node? > I have tested with more than one node but the error still happened. You copied you SGE aware version to all nodes at the same location? Are you > getting the correct `mpiexec` and shared libraries in your jobscript? Shows > the output of: > I installed it from the ubuntu apt-get on each node, so the OpenMPI is in standard location. In fact ubuntu handles all dependencies very well without worrying about PATH or LD_LIBRARY_PATH. > #!/bin/sh > which mpiexec > echo $LD_LIBRARY_PATH > ldd ompi_job > > the expected ones (ompi_job is the binary and ompi_job.sh the script) when > submitted with a PE request? > /usr/bin/mpiexec /usr/lib/openmpi/lib:/usr/lib/openmpi/lib/openmpi linux-vdso.so.1 => (0x7fff9b1ff000) libmpi.so.0 => /usr/lib/libmpi.so.0 (0x2af0868aa000) libopen-rte.so.0 => /usr/lib/libopen-rte.so.0 (0x2af086b58000) libopen-pal.so.0 => /usr/lib/libopen-pal.so.0 (0x2af086da4000) libdl.so.2 => /lib/libdl.so.2 (0x2af087017000) libnsl.so.1 => /lib/libnsl.so.1 (0x2af08721b000) libutil.so.1 => /lib/libutil.so.1 (0x2af087436000) libm.so.6 => /lib/libm.so.6 (0x2af087639000) libpthread.so.0 => /lib/libpthread.so.0 (0x2af0878bc000) libc.so.6 => /lib/libc.so.6 (0x2af087ada000) /lib64/ld-linux-x86-64.so.2 (0x2af086687000) Below are some runtime data inside a job spooling directory on the execution host pwbcad@sgeqexec01:128.1$ ls addgrpid config environment error exit_status job_pid pe_hostfile pid trace usage *pwbcad@sgeqexec01:128.1$ cat config* add_grp_id=65416 fs_stdin_host="" fs_stdin_path= fs_stdin_tmp_path=/tmp/128.1.dev.q/ fs_stdin_file_staging=0 fs_stdout_host="" fs_stdout_path= fs_stdout_tmp_path=/tmp/128.1.dev.q/ fs_stdout_file_staging=0 fs_stderr_host="" fs_stderr_path= fs_stderr_tmp_path=/tmp/128.1.dev.q/ fs_stderr_file_staging=0 stdout_path=/mnt/FacilityBioinformatics/pwbcad stderr_path=/mnt/FacilityBioinformatics/pwbcad stdin_path=/dev/null merge_stderr=1 tmpdir=/tmp/128.1.dev.q handle_as_binary=0 no_shell=0 ckpt_job=0 h_vmem=INFINITY h_vmem_is_consumable_job=0 s_vmem=INFINITY s_vmem_is_consumable_job=0 h_cpu=INFINITY h_cpu_is_consumable_job=0 s_cpu=INFINITY s_cpu_is_consumable_job=0 h_stack=INFINITY h_stack_is_consumable_job=0 s_stack=INFINITY s_stack_is_consumable_job=0 h_data=INFINITY h_data_is_consumable_job=0 s_data=INFINITY s_data_is_consumable_job=0 h_core=INFINITY s_core=INFINITY h_rss=INFINITY s_rss=INFINITY h_fsize=INFINITY s_fsize=INFINITY s_descriptors=UNDEFINED h_descriptors=UNDEFINED s_maxproc=UNDEFINED h_maxproc=UNDEFINED s_memorylocked=UNDEFINED h_memorylocked=UNDEFINED s_locks=UNDEFINED h_locks=UNDEFINED priority=0 shell_path=/bin/bash script_file=/var/spool/gridengine/execd/sgeqexec01/job_scripts/128 job_owner=pwbcad min_gid=0 min_uid=0 cwd=/mnt/FacilityBioinformatics/pwbcad prolog=none epilog=none starter_method=NONE suspend_method=NONE resume_method=NONE terminate_method=NONE script_timeout=120 pe=orte pe_slots=16 host_slots=8 pe_hostfile=/var/spool/gridengine/execd/sgeqexec01/active_jobs/128.1/pe_hostfile pe_start=/bin/true pe_stop=/bin/true pe_stdout_path=/mnt/FacilityBioinformatics/pwbcad pe_stderr_path=/mnt/FacilityBioinformatics/pwbcad shell_start_mode=posix_compliant use_login_shell=1 mail_list=pwb...@enzo.garvan.unsw.edu.au mail_options=0 forbid_reschedule=0 forbid_apperror=0 queue=dev.q host=sgeqexec01.garvan.unsw.edu.au processors=UNDEFINED binding=NULL job_name=run_cal_pi_auto job_id=128 ja_task_id=0 account=sge submission_time=1302987873 notify=0 acct_project=none njob_args=0 queue_tmpdir=/tmp use_afs=0 admin_user=sgeadmin notify_kill_type=1 notify_kill=default notify_susp_type=1 notify_susp=default qsub_gid=no pty=0 write_osjob_id=1 inherit_env=1 enable_windomacc=0 enable_addgrp_kill=0 csp=0 ignore_fqdn=0 default_domain=none *pwbcad@sgeqexec01:128.1$ cat environment* USER=pwbcad SSH_CLIENT=149.171.200.64 63056 22 MAIL=/var/mail/pwbcad SHLVL=1 OLDPWD=/home/pwbcad HOME=/home/pwbcad SSH_TTY=/dev/pts/4 PAGER=less PS1=\[\e[32;1m\]\u\[\e[0m\]@\[\e[35;1m\]\h\[\e[0m\]:\[\e[34;1m\]\W\[\e[0m\]\$ LOGNAME=pwbcad _=/usr/bin/qsub TERM=xterm SGE_ROOT=/var/lib/gridengine PATH=/tmp/128.1.dev.q:.:/home/pwbcad/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/meme/bin:/usr/local/eigenstrat:/usr/local/tophat/bin:/usr/local/cufflinks/bin:/usr/local/defuse/bin:/usr/local/bowtie/bin:/usr/local/cnvseq/bin:/usr/local/fastx_toolkit/bin:/usr/local/breakway/bin SGE_CELL=default LANG=en_AU.UTF-8 SHELL=/bin/bash PWD=/mnt/FacilityBioinformatics/pwbcad SSH_CONNECTION=1
Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)
Am 15.04.2011 um 23:02 schrieb Derrick LIN: > - what is your SGE configuration `qconf -sconf`? > > > rlogin_daemon/usr/sbin/sshd -i > rlogin_command /usr/bin/ssh > qlogin_daemon/usr/sbin/sshd -i > qlogin_command /usr/share/gridengine/qlogin-wrapper > rsh_daemon /usr/sbin/sshd -i > rsh_command /usr/bin/ssh So you route the SGE startup mechanism to use `ssh`, nevertherless is should work of course. Small difference to a conventional `ssh` is, that SGE will start a private daemon for each job on the nodes listening on a random port. When you use only one host, then forks will be created but no `ssh` call. Your test uses more than one node? You copied you SGE aware version to all nodes at the same location? Are you getting the correct `mpiexec` and shared libraries in your jobscript? Shows the output of: #!/bin/sh which mpiexec echo $LD_LIBRARY_PATH ldd ompi_job the expected ones (ompi_job is the binary and ompi_job.sh the script) when submitted with a PE request? -- Reuti > jsv_url none > jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w > > # my queue setting is: > > qname dev.q > hostlist sgeqexec01.domain.com.au > seq_no0 > load_thresholds np_load_avg=1.75 > suspend_thresholdsNONE > nsuspend 1 > suspend_interval 00:05:00 > priority 0 > min_cpu_interval 00:05:00 > processorsUNDEFINED > qtype BATCH INTERACTIVE > ckpt_list NONE > pe_list make orte > rerun FALSE > slots 8 > tmpdir/tmp > shell /bin/bash > prologNONE > epilogNONE > shell_start_mode posix_compliant > starter_methodNONE > suspend_methodNONE > resume_method NONE > terminate_method NONE > notify00:00:60 > owner_listNONE > user_listsNONE > xuser_lists NONE > subordinate_list NONE > complex_valuesNONE > projects NONE > xprojects NONE > calendar NONE > initial_state default > s_rt INFINITY > h_rt INFINITY > s_cpu INFINITY > h_cpu INFINITY > s_fsize INFINITY > h_fsize INFINITY > s_dataINFINITY > h_dataINFINITY > s_stack INFINITY > h_stack INFINITY > s_coreINFINITY > h_coreINFINITY > s_rss INFINITY > h_rss INFINITY > s_vmemINFINITY > h_vmemINFINITY > > # my PE setting is: > > pe_nameorte > slots 4 > user_lists NONE > xuser_listsNONE > start_proc_args/bin/true > stop_proc_args /bin/true > allocation_rule$round_robin > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > accounting_summary FALSE > > a) you are testing from master to a node, but jobs are running between nodes. > > b) unless you need X11 forwarding, using SGE’s -builtin- communication works > fine, this way you can have a cluster without `rsh` or `ssh` (or limited to > admin staff) and can still run parallel jobs. > > Sorry for the misleading snip. All the hosts (both master and execution host) > in the cluster can powerwordless each other without an issue. As my 2) > states, I could run a generic openmpi job without the SGE successfully. So I > do not think is the communication issue? > > Then you are bypassing SGE’s slot allocation and will have wrong accounting > and no job control of the slave tasks. > > I know it's not a proper submission as a PE job. I simply ran out of idea > what to do next. Even it's not a proper way, but that openmpi error didn't > happen and the job completed. I am wondering why. > > > The correct version of my OpenMPI is 1.4.1, not 1.3 in my first post. > > I have installed OpenMPI on the submission host and the master later, but it > didn't help. So I guess OpenMPI is needed in execution hosts only. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)
> > - what is your SGE configuration `qconf -sconf`? #global: execd_spool_dir /var/spool/gridengine/execd mailer /usr/bin/mail xterm/usr/bin/xterm load_sensor none prolog none epilog none shell_start_mode posix_compliant login_shells bash,sh,ksh,csh,tcsh min_uid 0 min_gid 0 user_lists none xuser_lists none projects none xprojectsnone enforce_project false enforce_user auto load_report_time 00:00:40 max_unheard 00:05:00 reschedule_unknown 00:00:00 loglevel log_warning administrator_mail root set_token_cmdnone pag_cmd none token_extend_timenone shepherd_cmd none qmaster_params none execd_params none reporting_params accounting=true reporting=false \ flush_time=00:00:15 joblog=false sharelog=00:00:00 finished_jobs100 gid_range65400-65500 max_aj_instances 2000 max_aj_tasks 75000 max_u_jobs 0 max_jobs 0 auto_user_oticket0 auto_user_fshare 0 auto_user_default_projectnone auto_user_delete_time86400 delegated_file_staging false reprioritize false rlogin_daemon/usr/sbin/sshd -i rlogin_command /usr/bin/ssh qlogin_daemon/usr/sbin/sshd -i qlogin_command /usr/share/gridengine/qlogin-wrapper rsh_daemon /usr/sbin/sshd -i rsh_command /usr/bin/ssh jsv_url none jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w # my queue setting is: qname dev.q hostlist sgeqexec01.domain.com.au seq_no0 load_thresholds np_load_avg=1.75 suspend_thresholdsNONE nsuspend 1 suspend_interval 00:05:00 priority 0 min_cpu_interval 00:05:00 processorsUNDEFINED qtype BATCH INTERACTIVE ckpt_list NONE pe_list make orte rerun FALSE slots 8 tmpdir/tmp shell /bin/bash prologNONE epilogNONE shell_start_mode posix_compliant starter_methodNONE suspend_methodNONE resume_method NONE terminate_method NONE notify00:00:60 owner_listNONE user_listsNONE xuser_lists NONE subordinate_list NONE complex_valuesNONE projects NONE xprojects NONE calendar NONE initial_state default s_rt INFINITY h_rt INFINITY s_cpu INFINITY h_cpu INFINITY s_fsize INFINITY h_fsize INFINITY s_dataINFINITY h_dataINFINITY s_stack INFINITY h_stack INFINITY s_coreINFINITY h_coreINFINITY s_rss INFINITY h_rss INFINITY s_vmemINFINITY h_vmemINFINITY # my PE setting is: pe_nameorte slots 4 user_lists NONE xuser_listsNONE start_proc_args/bin/true stop_proc_args /bin/true allocation_rule$round_robin control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSE > a) you are testing from master to a node, but jobs are running between > nodes. > b) unless you need X11 forwarding, using SGE’s -builtin- communication > works fine, this way you can have a cluster without `rsh` or `ssh` (or > limited to admin staff) and can still run parallel jobs. > Sorry for the misleading snip. All the hosts (both master and execution host) in the cluster can powerwordless each other without an issue. As my 2) states, I could run a generic openmpi job without the SGE successfully. So I do not think is the communication issue? > Then you are bypassing SGE’s slot allocation and will have wrong accounting > and no job control of the slave tasks. > I know it's not a proper submission as a PE job. I simply ran out of idea what to do next. Even it's not a proper way, but that openmpi error didn't happen and the job completed. I am wondering why. The correct version of my OpenMPI is 1.4.1, not 1.3 in my first post. I have installed OpenMPI on the submission host and the master later, but it didn't help. So I guess OpenMPI is needed in execution hosts only.