Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)

2011-04-16 Thread Derrick LIN
> Well, does `mpiexec` point to the correct one?

I don't really get this. I only installed one and only one OpenMPI on the
node. There shouldn't have another 'mpiexec' on the system.

It's worthy to mention that every node is deployed from a master image. So
everything is exactly the same except IP and DNS name.

> I thought you compiled it on your own with --with-sge. What about:
pwbcad@sgeqexec01:~$ ompi_info | grep grid
 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.1)

Is there any location I can find a more meaningful OpenMPI log?

I will try to install openmpi 1.4.3 and see if that works.

I want to confirm one more thing: does SGE's master host need to have
OpenMPI installed? Is it relevant?

Many thanks Reuti

Derrick


Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)

2011-04-16 Thread Derrick LIN
>
> So you route the SGE startup mechanism to use `ssh`, nevertherless is
> should work of course. Small difference to a conventional `ssh` is, that SGE
> will start a private daemon for each job on the nodes listening on a random
> port.
>
> When you use only one host, then forks will be created but no `ssh` call.
> Your test uses more than one node?
>

I have tested with more than one node but the error still happened.

You copied you SGE aware version to all nodes at the same location? Are you
> getting the correct `mpiexec` and shared libraries in your jobscript? Shows
> the output of:
>

I installed it from the ubuntu apt-get on each node, so the OpenMPI is in
standard location. In fact ubuntu handles all dependencies very well without
worrying about PATH or LD_LIBRARY_PATH.


> #!/bin/sh
> which mpiexec
> echo $LD_LIBRARY_PATH
> ldd ompi_job
>
> the expected ones (ompi_job is the binary and ompi_job.sh the script) when
> submitted with a PE request?
>

/usr/bin/mpiexec
/usr/lib/openmpi/lib:/usr/lib/openmpi/lib/openmpi
linux-vdso.so.1 =>  (0x7fff9b1ff000)
libmpi.so.0 => /usr/lib/libmpi.so.0 (0x2af0868aa000)
libopen-rte.so.0 => /usr/lib/libopen-rte.so.0 (0x2af086b58000)
libopen-pal.so.0 => /usr/lib/libopen-pal.so.0 (0x2af086da4000)
libdl.so.2 => /lib/libdl.so.2 (0x2af087017000)
libnsl.so.1 => /lib/libnsl.so.1 (0x2af08721b000)
libutil.so.1 => /lib/libutil.so.1 (0x2af087436000)
libm.so.6 => /lib/libm.so.6 (0x2af087639000)
libpthread.so.0 => /lib/libpthread.so.0 (0x2af0878bc000)
libc.so.6 => /lib/libc.so.6 (0x2af087ada000)
/lib64/ld-linux-x86-64.so.2 (0x2af086687000)

Below are some runtime data inside a job spooling directory on the execution
host

pwbcad@sgeqexec01:128.1$ ls
addgrpid  config  environment  error  exit_status  job_pid  pe_hostfile  pid
 trace  usage
*pwbcad@sgeqexec01:128.1$ cat config*
add_grp_id=65416
fs_stdin_host=""
fs_stdin_path=
fs_stdin_tmp_path=/tmp/128.1.dev.q/
fs_stdin_file_staging=0
fs_stdout_host=""
fs_stdout_path=
fs_stdout_tmp_path=/tmp/128.1.dev.q/
fs_stdout_file_staging=0
fs_stderr_host=""
fs_stderr_path=
fs_stderr_tmp_path=/tmp/128.1.dev.q/
fs_stderr_file_staging=0
stdout_path=/mnt/FacilityBioinformatics/pwbcad
stderr_path=/mnt/FacilityBioinformatics/pwbcad
stdin_path=/dev/null
merge_stderr=1
tmpdir=/tmp/128.1.dev.q
handle_as_binary=0
no_shell=0
ckpt_job=0
h_vmem=INFINITY
h_vmem_is_consumable_job=0
s_vmem=INFINITY
s_vmem_is_consumable_job=0
h_cpu=INFINITY
h_cpu_is_consumable_job=0
s_cpu=INFINITY
s_cpu_is_consumable_job=0
h_stack=INFINITY
h_stack_is_consumable_job=0
s_stack=INFINITY
s_stack_is_consumable_job=0
h_data=INFINITY
h_data_is_consumable_job=0
s_data=INFINITY
s_data_is_consumable_job=0
h_core=INFINITY
s_core=INFINITY
h_rss=INFINITY
s_rss=INFINITY
h_fsize=INFINITY
s_fsize=INFINITY
s_descriptors=UNDEFINED
h_descriptors=UNDEFINED
s_maxproc=UNDEFINED
h_maxproc=UNDEFINED
s_memorylocked=UNDEFINED
h_memorylocked=UNDEFINED
s_locks=UNDEFINED
h_locks=UNDEFINED
priority=0
shell_path=/bin/bash
script_file=/var/spool/gridengine/execd/sgeqexec01/job_scripts/128
job_owner=pwbcad
min_gid=0
min_uid=0
cwd=/mnt/FacilityBioinformatics/pwbcad
prolog=none
epilog=none
starter_method=NONE
suspend_method=NONE
resume_method=NONE
terminate_method=NONE
script_timeout=120
pe=orte
pe_slots=16
host_slots=8
pe_hostfile=/var/spool/gridengine/execd/sgeqexec01/active_jobs/128.1/pe_hostfile
pe_start=/bin/true
pe_stop=/bin/true
pe_stdout_path=/mnt/FacilityBioinformatics/pwbcad
pe_stderr_path=/mnt/FacilityBioinformatics/pwbcad
shell_start_mode=posix_compliant
use_login_shell=1
mail_list=pwb...@enzo.garvan.unsw.edu.au
mail_options=0
forbid_reschedule=0
forbid_apperror=0
queue=dev.q
host=sgeqexec01.garvan.unsw.edu.au
processors=UNDEFINED
binding=NULL
job_name=run_cal_pi_auto
job_id=128
ja_task_id=0
account=sge
submission_time=1302987873
notify=0
acct_project=none
njob_args=0
queue_tmpdir=/tmp
use_afs=0
admin_user=sgeadmin
notify_kill_type=1
notify_kill=default
notify_susp_type=1
notify_susp=default
qsub_gid=no
pty=0
write_osjob_id=1
inherit_env=1
enable_windomacc=0
enable_addgrp_kill=0
csp=0
ignore_fqdn=0
default_domain=none
*pwbcad@sgeqexec01:128.1$ cat environment*
USER=pwbcad
SSH_CLIENT=149.171.200.64 63056 22
MAIL=/var/mail/pwbcad
SHLVL=1
OLDPWD=/home/pwbcad
HOME=/home/pwbcad
SSH_TTY=/dev/pts/4
PAGER=less
PS1=\[\e[32;1m\]\u\[\e[0m\]@\[\e[35;1m\]\h\[\e[0m\]:\[\e[34;1m\]\W\[\e[0m\]\$
LOGNAME=pwbcad
_=/usr/bin/qsub
TERM=xterm
SGE_ROOT=/var/lib/gridengine
PATH=/tmp/128.1.dev.q:.:/home/pwbcad/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/meme/bin:/usr/local/eigenstrat:/usr/local/tophat/bin:/usr/local/cufflinks/bin:/usr/local/defuse/bin:/usr/local/bowtie/bin:/usr/local/cnvseq/bin:/usr/local/fastx_toolkit/bin:/usr/local/breakway/bin
SGE_CELL=default
LANG=en_AU.UTF-8
SHELL=/bin/bash
PWD=/mnt/FacilityBioinformatics/pwbcad

Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)

2011-04-15 Thread Derrick LIN
>
> - what is your SGE configuration `qconf -sconf`?


#global:
execd_spool_dir  /var/spool/gridengine/execd
mailer   /usr/bin/mail
xterm/usr/bin/xterm
load_sensor  none
prolog   none
epilog   none
shell_start_mode posix_compliant
login_shells bash,sh,ksh,csh,tcsh
min_uid  0
min_gid  0
user_lists   none
xuser_lists  none
projects none
xprojectsnone
enforce_project  false
enforce_user auto
load_report_time 00:00:40
max_unheard  00:05:00
reschedule_unknown   00:00:00
loglevel log_warning
administrator_mail   root
set_token_cmdnone
pag_cmd  none
token_extend_timenone
shepherd_cmd none
qmaster_params   none
execd_params none
reporting_params accounting=true reporting=false \
 flush_time=00:00:15 joblog=false
sharelog=00:00:00
finished_jobs100
gid_range65400-65500
max_aj_instances 2000
max_aj_tasks 75000
max_u_jobs   0
max_jobs 0
auto_user_oticket0
auto_user_fshare 0
auto_user_default_projectnone
auto_user_delete_time86400
delegated_file_staging   false
reprioritize false
rlogin_daemon/usr/sbin/sshd -i
rlogin_command   /usr/bin/ssh
qlogin_daemon/usr/sbin/sshd -i
qlogin_command   /usr/share/gridengine/qlogin-wrapper
rsh_daemon   /usr/sbin/sshd -i
rsh_command  /usr/bin/ssh
jsv_url  none
jsv_allowed_mod  ac,h,i,e,o,j,M,N,p,w

# my queue setting is:

qname dev.q
hostlist  sgeqexec01.domain.com.au
seq_no0
load_thresholds   np_load_avg=1.75
suspend_thresholdsNONE
nsuspend  1
suspend_interval  00:05:00
priority  0
min_cpu_interval  00:05:00
processorsUNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list   make orte
rerun FALSE
slots 8
tmpdir/tmp
shell /bin/bash
prologNONE
epilogNONE
shell_start_mode  posix_compliant
starter_methodNONE
suspend_methodNONE
resume_method NONE
terminate_method  NONE
notify00:00:60
owner_listNONE
user_listsNONE
xuser_lists   NONE
subordinate_list  NONE
complex_valuesNONE
projects  NONE
xprojects NONE
calendar  NONE
initial_state default
s_rt  INFINITY
h_rt  INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize   INFINITY
h_fsize   INFINITY
s_dataINFINITY
h_dataINFINITY
s_stack   INFINITY
h_stack   INFINITY
s_coreINFINITY
h_coreINFINITY
s_rss INFINITY
h_rss INFINITY
s_vmemINFINITY
h_vmemINFINITY

# my PE setting is:

pe_nameorte
slots  4
user_lists NONE
xuser_listsNONE
start_proc_args/bin/true
stop_proc_args /bin/true
allocation_rule$round_robin
control_slaves TRUE
job_is_first_task  FALSE
urgency_slots  min
accounting_summary FALSE


> a) you are testing from master to a node, but jobs are running between
> nodes.


> b) unless you need X11 forwarding, using SGE’s -builtin- communication
> works fine, this way you can have a cluster without `rsh` or `ssh` (or
> limited to admin staff) and can still run parallel jobs.
>

Sorry for the misleading snip. All the hosts (both master and execution
host) in the cluster can powerwordless each other without an issue. As my 2)
states, I could run a generic openmpi job without the SGE successfully. So I
do not think is the communication issue?


> Then you are bypassing SGE’s slot allocation and will have wrong accounting
> and no job control of the slave tasks.
>

I know it's not a proper submission as a PE job. I simply ran out of idea
what to do next. Even it's not a proper way, but that openmpi error didn't
happen and the job completed. I am wondering why.


The correct version of my OpenMPI is 1.4.1, not 1.3 in my first post.

I have installed OpenMPI on the submission host and the master later, but it
didn't help. So I guess OpenMPI is needed in execution hosts only.


[OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed)

2011-04-15 Thread Derrick LIN
Hi all,

I am trying to setup a small SGE cluster with OpenMPI integrated but I am
totally stuck when trying to run a openmpi job to the SGE's PE.

I mainly followed the guide sge-snow.pdf from Revolutions Computing and
http://idolinux.blogspot.com/2010/04/quick-install-of-open-mpi-with-grid.html

The cluster is entirely ubuntu 10.10 based, both SGE 6.2u5 and OpenMPI 1.3
are directly from apt-get except OpenMPI is rebuilt from source with
--with-sge flag.

Note: OpenMPI has been installed on all execution hosts, not on the queue
master and submission host.

I submited a job by

qsub -pe orte 8 ./ompi_job.sh

The error I got looks like
=

[sgeqexec01:06612] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
file ../../../../../../orte/mca/ess/hnp/ess_hnp_module.c
at line 161
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_plm_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--
[sgeqexec01:06612] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
file ../../../orte/runtime/orte_init.c
at line 132
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_set_name failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--
[sgeqexec01:06612] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
file ../../../../../orte/tools/orterun/orterun.c
at line 541

==

For troubleshooting I have done several things below:

1) passwordless SSH has been configurated properly for the execution hosts
and the queue master.

pwbcad@sgeqmast01:~$ ssh sgeqexec01 uptime
 14:35:54 up  2:47,  1 user,  load average: 0.10, 0.08, 0.02

2) I could run a openmpi job outside the SGE successfully.

mpirun -host n1, n2 -np 8 ./ompi_job

3) I submitted job to a queue directly instead of a PE, the job could run
and completed successfully

qsub -q dev.q ./ompi_job.sh

4) Although I don't think PATH and LD_LIBRARY_PATH would cause issues in
ubuntu, I still add OpenMPI binaries and libraries to both. But it didn't
help.

It will be very appreciated if anyone can share their experience!

Derrick