Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)
> Well, does `mpiexec` point to the correct one? I don't really get this. I only installed one and only one OpenMPI on the node. There shouldn't have another 'mpiexec' on the system. It's worthy to mention that every node is deployed from a master image. So everything is exactly the same except IP and DNS name. > I thought you compiled it on your own with --with-sge. What about: pwbcad@sgeqexec01:~$ ompi_info | grep grid MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.1) Is there any location I can find a more meaningful OpenMPI log? I will try to install openmpi 1.4.3 and see if that works. I want to confirm one more thing: does SGE's master host need to have OpenMPI installed? Is it relevant? Many thanks Reuti Derrick
Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)
> > So you route the SGE startup mechanism to use `ssh`, nevertherless is > should work of course. Small difference to a conventional `ssh` is, that SGE > will start a private daemon for each job on the nodes listening on a random > port. > > When you use only one host, then forks will be created but no `ssh` call. > Your test uses more than one node? > I have tested with more than one node but the error still happened. You copied you SGE aware version to all nodes at the same location? Are you > getting the correct `mpiexec` and shared libraries in your jobscript? Shows > the output of: > I installed it from the ubuntu apt-get on each node, so the OpenMPI is in standard location. In fact ubuntu handles all dependencies very well without worrying about PATH or LD_LIBRARY_PATH. > #!/bin/sh > which mpiexec > echo $LD_LIBRARY_PATH > ldd ompi_job > > the expected ones (ompi_job is the binary and ompi_job.sh the script) when > submitted with a PE request? > /usr/bin/mpiexec /usr/lib/openmpi/lib:/usr/lib/openmpi/lib/openmpi linux-vdso.so.1 => (0x7fff9b1ff000) libmpi.so.0 => /usr/lib/libmpi.so.0 (0x2af0868aa000) libopen-rte.so.0 => /usr/lib/libopen-rte.so.0 (0x2af086b58000) libopen-pal.so.0 => /usr/lib/libopen-pal.so.0 (0x2af086da4000) libdl.so.2 => /lib/libdl.so.2 (0x2af087017000) libnsl.so.1 => /lib/libnsl.so.1 (0x2af08721b000) libutil.so.1 => /lib/libutil.so.1 (0x2af087436000) libm.so.6 => /lib/libm.so.6 (0x2af087639000) libpthread.so.0 => /lib/libpthread.so.0 (0x2af0878bc000) libc.so.6 => /lib/libc.so.6 (0x2af087ada000) /lib64/ld-linux-x86-64.so.2 (0x2af086687000) Below are some runtime data inside a job spooling directory on the execution host pwbcad@sgeqexec01:128.1$ ls addgrpid config environment error exit_status job_pid pe_hostfile pid trace usage *pwbcad@sgeqexec01:128.1$ cat config* add_grp_id=65416 fs_stdin_host="" fs_stdin_path= fs_stdin_tmp_path=/tmp/128.1.dev.q/ fs_stdin_file_staging=0 fs_stdout_host="" fs_stdout_path= fs_stdout_tmp_path=/tmp/128.1.dev.q/ fs_stdout_file_staging=0 fs_stderr_host="" fs_stderr_path= fs_stderr_tmp_path=/tmp/128.1.dev.q/ fs_stderr_file_staging=0 stdout_path=/mnt/FacilityBioinformatics/pwbcad stderr_path=/mnt/FacilityBioinformatics/pwbcad stdin_path=/dev/null merge_stderr=1 tmpdir=/tmp/128.1.dev.q handle_as_binary=0 no_shell=0 ckpt_job=0 h_vmem=INFINITY h_vmem_is_consumable_job=0 s_vmem=INFINITY s_vmem_is_consumable_job=0 h_cpu=INFINITY h_cpu_is_consumable_job=0 s_cpu=INFINITY s_cpu_is_consumable_job=0 h_stack=INFINITY h_stack_is_consumable_job=0 s_stack=INFINITY s_stack_is_consumable_job=0 h_data=INFINITY h_data_is_consumable_job=0 s_data=INFINITY s_data_is_consumable_job=0 h_core=INFINITY s_core=INFINITY h_rss=INFINITY s_rss=INFINITY h_fsize=INFINITY s_fsize=INFINITY s_descriptors=UNDEFINED h_descriptors=UNDEFINED s_maxproc=UNDEFINED h_maxproc=UNDEFINED s_memorylocked=UNDEFINED h_memorylocked=UNDEFINED s_locks=UNDEFINED h_locks=UNDEFINED priority=0 shell_path=/bin/bash script_file=/var/spool/gridengine/execd/sgeqexec01/job_scripts/128 job_owner=pwbcad min_gid=0 min_uid=0 cwd=/mnt/FacilityBioinformatics/pwbcad prolog=none epilog=none starter_method=NONE suspend_method=NONE resume_method=NONE terminate_method=NONE script_timeout=120 pe=orte pe_slots=16 host_slots=8 pe_hostfile=/var/spool/gridengine/execd/sgeqexec01/active_jobs/128.1/pe_hostfile pe_start=/bin/true pe_stop=/bin/true pe_stdout_path=/mnt/FacilityBioinformatics/pwbcad pe_stderr_path=/mnt/FacilityBioinformatics/pwbcad shell_start_mode=posix_compliant use_login_shell=1 mail_list=pwb...@enzo.garvan.unsw.edu.au mail_options=0 forbid_reschedule=0 forbid_apperror=0 queue=dev.q host=sgeqexec01.garvan.unsw.edu.au processors=UNDEFINED binding=NULL job_name=run_cal_pi_auto job_id=128 ja_task_id=0 account=sge submission_time=1302987873 notify=0 acct_project=none njob_args=0 queue_tmpdir=/tmp use_afs=0 admin_user=sgeadmin notify_kill_type=1 notify_kill=default notify_susp_type=1 notify_susp=default qsub_gid=no pty=0 write_osjob_id=1 inherit_env=1 enable_windomacc=0 enable_addgrp_kill=0 csp=0 ignore_fqdn=0 default_domain=none *pwbcad@sgeqexec01:128.1$ cat environment* USER=pwbcad SSH_CLIENT=149.171.200.64 63056 22 MAIL=/var/mail/pwbcad SHLVL=1 OLDPWD=/home/pwbcad HOME=/home/pwbcad SSH_TTY=/dev/pts/4 PAGER=less PS1=\[\e[32;1m\]\u\[\e[0m\]@\[\e[35;1m\]\h\[\e[0m\]:\[\e[34;1m\]\W\[\e[0m\]\$ LOGNAME=pwbcad _=/usr/bin/qsub TERM=xterm SGE_ROOT=/var/lib/gridengine PATH=/tmp/128.1.dev.q:.:/home/pwbcad/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/meme/bin:/usr/local/eigenstrat:/usr/local/tophat/bin:/usr/local/cufflinks/bin:/usr/local/defuse/bin:/usr/local/bowtie/bin:/usr/local/cnvseq/bin:/usr/local/fastx_toolkit/bin:/usr/local/breakway/bin SGE_CELL=default LANG=en_AU.UTF-8 SHELL=/bin/bash PWD=/mnt/FacilityBioinformatics/pwbcad
Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)
> > - what is your SGE configuration `qconf -sconf`? #global: execd_spool_dir /var/spool/gridengine/execd mailer /usr/bin/mail xterm/usr/bin/xterm load_sensor none prolog none epilog none shell_start_mode posix_compliant login_shells bash,sh,ksh,csh,tcsh min_uid 0 min_gid 0 user_lists none xuser_lists none projects none xprojectsnone enforce_project false enforce_user auto load_report_time 00:00:40 max_unheard 00:05:00 reschedule_unknown 00:00:00 loglevel log_warning administrator_mail root set_token_cmdnone pag_cmd none token_extend_timenone shepherd_cmd none qmaster_params none execd_params none reporting_params accounting=true reporting=false \ flush_time=00:00:15 joblog=false sharelog=00:00:00 finished_jobs100 gid_range65400-65500 max_aj_instances 2000 max_aj_tasks 75000 max_u_jobs 0 max_jobs 0 auto_user_oticket0 auto_user_fshare 0 auto_user_default_projectnone auto_user_delete_time86400 delegated_file_staging false reprioritize false rlogin_daemon/usr/sbin/sshd -i rlogin_command /usr/bin/ssh qlogin_daemon/usr/sbin/sshd -i qlogin_command /usr/share/gridengine/qlogin-wrapper rsh_daemon /usr/sbin/sshd -i rsh_command /usr/bin/ssh jsv_url none jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w # my queue setting is: qname dev.q hostlist sgeqexec01.domain.com.au seq_no0 load_thresholds np_load_avg=1.75 suspend_thresholdsNONE nsuspend 1 suspend_interval 00:05:00 priority 0 min_cpu_interval 00:05:00 processorsUNDEFINED qtype BATCH INTERACTIVE ckpt_list NONE pe_list make orte rerun FALSE slots 8 tmpdir/tmp shell /bin/bash prologNONE epilogNONE shell_start_mode posix_compliant starter_methodNONE suspend_methodNONE resume_method NONE terminate_method NONE notify00:00:60 owner_listNONE user_listsNONE xuser_lists NONE subordinate_list NONE complex_valuesNONE projects NONE xprojects NONE calendar NONE initial_state default s_rt INFINITY h_rt INFINITY s_cpu INFINITY h_cpu INFINITY s_fsize INFINITY h_fsize INFINITY s_dataINFINITY h_dataINFINITY s_stack INFINITY h_stack INFINITY s_coreINFINITY h_coreINFINITY s_rss INFINITY h_rss INFINITY s_vmemINFINITY h_vmemINFINITY # my PE setting is: pe_nameorte slots 4 user_lists NONE xuser_listsNONE start_proc_args/bin/true stop_proc_args /bin/true allocation_rule$round_robin control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSE > a) you are testing from master to a node, but jobs are running between > nodes. > b) unless you need X11 forwarding, using SGE’s -builtin- communication > works fine, this way you can have a cluster without `rsh` or `ssh` (or > limited to admin staff) and can still run parallel jobs. > Sorry for the misleading snip. All the hosts (both master and execution host) in the cluster can powerwordless each other without an issue. As my 2) states, I could run a generic openmpi job without the SGE successfully. So I do not think is the communication issue? > Then you are bypassing SGE’s slot allocation and will have wrong accounting > and no job control of the slave tasks. > I know it's not a proper submission as a PE job. I simply ran out of idea what to do next. Even it's not a proper way, but that openmpi error didn't happen and the job completed. I am wondering why. The correct version of my OpenMPI is 1.4.1, not 1.3 in my first post. I have installed OpenMPI on the submission host and the master later, but it didn't help. So I guess OpenMPI is needed in execution hosts only.
[OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed)
Hi all, I am trying to setup a small SGE cluster with OpenMPI integrated but I am totally stuck when trying to run a openmpi job to the SGE's PE. I mainly followed the guide sge-snow.pdf from Revolutions Computing and http://idolinux.blogspot.com/2010/04/quick-install-of-open-mpi-with-grid.html The cluster is entirely ubuntu 10.10 based, both SGE 6.2u5 and OpenMPI 1.3 are directly from apt-get except OpenMPI is rebuilt from source with --with-sge flag. Note: OpenMPI has been installed on all execution hosts, not on the queue master and submission host. I submited a job by qsub -pe orte 8 ./ompi_job.sh The error I got looks like = [sgeqexec01:06612] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ../../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line 161 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_plm_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- [sgeqexec01:06612] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ../../../orte/runtime/orte_init.c at line 132 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_set_name failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- [sgeqexec01:06612] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ../../../../../orte/tools/orterun/orterun.c at line 541 == For troubleshooting I have done several things below: 1) passwordless SSH has been configurated properly for the execution hosts and the queue master. pwbcad@sgeqmast01:~$ ssh sgeqexec01 uptime 14:35:54 up 2:47, 1 user, load average: 0.10, 0.08, 0.02 2) I could run a openmpi job outside the SGE successfully. mpirun -host n1, n2 -np 8 ./ompi_job 3) I submitted job to a queue directly instead of a PE, the job could run and completed successfully qsub -q dev.q ./ompi_job.sh 4) Although I don't think PATH and LD_LIBRARY_PATH would cause issues in ubuntu, I still add OpenMPI binaries and libraries to both. But it didn't help. It will be very appreciated if anyone can share their experience! Derrick