Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)

2011-04-18 Thread Reuti
Am 17.04.2011 um 01:21 schrieb Derrick LIN:

> 
> > Well, does `mpiexec` point to the correct one?
> 
> I don't really get this. I only installed one and only one OpenMPI on the 
> node. There shouldn't have another 'mpiexec' on the system.

It could be one from any other MPI implementation by accident.


> It's worthy to mention that every node is deployed from a master image. So 
> everything is exactly the same except IP and DNS name.
> > I thought you compiled it on your own with --with-sge. What about: 
> 
> pwbcad@sgeqexec01:~$ ompi_info | grep grid
>  MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.1)

Fine.


> Is there any location I can find a more meaningful OpenMPI log?

Can you run a simple `mpiexec hostname` in the script?


> I will try to install openmpi 1.4.3 and see if that works.
> 
> I want to confirm one more thing: does SGE's master host need to have OpenMPI 
> installed? Is it relevant?

In principle: no. But often it's installed too, as you will compile on either 
the master machine or a dedicated login server.

-- Reuti


> Many thanks Reuti
> 
> Derrick
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)

2011-04-17 Thread Ralph Castain
I'm no SGE expert, but I do note that your original error indicates that mpirun 
was unable to find a launcher for your environment. When running under SGE, 
mpirun looks for certain environmental variables indicative of SGE. If it finds 
those, it then looks for the "qrsh" command. If it doesn't find "qrsh" and/or 
it isn't executable by the user, then you will fail with that error.

Given that you have the envars, is "qrsh" in your path where mpirun is 
executing? If not, then that is the reason why you are able to run outside of 
SGE (where mpirun will default to using ssh) and not inside it.


On Apr 16, 2011, at 5:21 PM, Derrick LIN wrote:

> 
> > Well, does `mpiexec` point to the correct one?
> 
> I don't really get this. I only installed one and only one OpenMPI on the 
> node. There shouldn't have another 'mpiexec' on the system.
> 
> It's worthy to mention that every node is deployed from a master image. So 
> everything is exactly the same except IP and DNS name.
> > I thought you compiled it on your own with --with-sge. What about: 
> 
> pwbcad@sgeqexec01:~$ ompi_info | grep grid
>  MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.1)
> 
> Is there any location I can find a more meaningful OpenMPI log?
> 
> I will try to install openmpi 1.4.3 and see if that works.
> 
> I want to confirm one more thing: does SGE's master host need to have OpenMPI 
> installed? Is it relevant?
> 
> Many thanks Reuti
> 
> Derrick
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)

2011-04-16 Thread Derrick LIN
> Well, does `mpiexec` point to the correct one?

I don't really get this. I only installed one and only one OpenMPI on the
node. There shouldn't have another 'mpiexec' on the system.

It's worthy to mention that every node is deployed from a master image. So
everything is exactly the same except IP and DNS name.

> I thought you compiled it on your own with --with-sge. What about:
pwbcad@sgeqexec01:~$ ompi_info | grep grid
 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.1)

Is there any location I can find a more meaningful OpenMPI log?

I will try to install openmpi 1.4.3 and see if that works.

I want to confirm one more thing: does SGE's master host need to have
OpenMPI installed? Is it relevant?

Many thanks Reuti

Derrick


Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)

2011-04-16 Thread Reuti
Am 16.04.2011 um 23:09 schrieb Derrick LIN:

> So you route the SGE startup mechanism to use `ssh`, nevertherless is should 
> work of course. Small difference to a conventional `ssh` is, that SGE will 
> start a private daemon for each job on the nodes listening on a random port.
> 
> When you use only one host, then forks will be created but no `ssh` call. 
> Your test uses more than one node?
> 
> I have tested with more than one node but the error still happened. 
> 
> You copied you SGE aware version to all nodes at the same location? Are you 
> getting the correct `mpiexec` and shared libraries in your jobscript? Shows 
> the output of:
> 
> I installed it from the ubuntu apt-get on each node, so the OpenMPI is in 
> standard location. In fact ubuntu handles all dependencies very well without 
> worrying about PATH or LD_LIBRARY_PATH.

Well, does `mpiexec` point to the correct one? 

I thought you compiled it on your own with --with-sge. What about:

$ ompi_info | grep grid
 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.3)

You have this on all nodes and your binary was compiled with this version?

All stuff below looks fine.

You can even try to start "from scratch" with a private copy of Open MPI which 
you install for example in $HOME/local/openmpi-1.4.3 and set the paths 
accordingly.

-- Reuti


> #!/bin/sh
> which mpiexec
> echo $LD_LIBRARY_PATH
> ldd ompi_job
> 
> the expected ones (ompi_job is the binary and ompi_job.sh the script) when 
> submitted with a PE request?
> 
> /usr/bin/mpiexec
> /usr/lib/openmpi/lib:/usr/lib/openmpi/lib/openmpi
> linux-vdso.so.1 =>  (0x7fff9b1ff000)
> libmpi.so.0 => /usr/lib/libmpi.so.0 (0x2af0868aa000)
> libopen-rte.so.0 => /usr/lib/libopen-rte.so.0 (0x2af086b58000)
> libopen-pal.so.0 => /usr/lib/libopen-pal.so.0 (0x2af086da4000)
> libdl.so.2 => /lib/libdl.so.2 (0x2af087017000)
> libnsl.so.1 => /lib/libnsl.so.1 (0x2af08721b000)
> libutil.so.1 => /lib/libutil.so.1 (0x2af087436000)
> libm.so.6 => /lib/libm.so.6 (0x2af087639000)
> libpthread.so.0 => /lib/libpthread.so.0 (0x2af0878bc000)
> libc.so.6 => /lib/libc.so.6 (0x2af087ada000)
> /lib64/ld-linux-x86-64.so.2 (0x2af086687000)
> 
> Below are some runtime data inside a job spooling directory on the execution 
> host
> 
> pwbcad@sgeqexec01:128.1$ ls
> addgrpid  config  environment  error  exit_status  job_pid  pe_hostfile  pid  
> trace  usage
> pwbcad@sgeqexec01:128.1$ cat config
> add_grp_id=65416
> fs_stdin_host=""
> fs_stdin_path=
> fs_stdin_tmp_path=/tmp/128.1.dev.q/
> fs_stdin_file_staging=0
> fs_stdout_host=""
> fs_stdout_path=
> fs_stdout_tmp_path=/tmp/128.1.dev.q/
> fs_stdout_file_staging=0
> fs_stderr_host=""
> fs_stderr_path=
> fs_stderr_tmp_path=/tmp/128.1.dev.q/
> fs_stderr_file_staging=0
> stdout_path=/mnt/FacilityBioinformatics/pwbcad
> stderr_path=/mnt/FacilityBioinformatics/pwbcad
> stdin_path=/dev/null
> merge_stderr=1
> tmpdir=/tmp/128.1.dev.q
> handle_as_binary=0
> no_shell=0
> ckpt_job=0
> h_vmem=INFINITY
> h_vmem_is_consumable_job=0
> s_vmem=INFINITY
> s_vmem_is_consumable_job=0
> h_cpu=INFINITY
> h_cpu_is_consumable_job=0
> s_cpu=INFINITY
> s_cpu_is_consumable_job=0
> h_stack=INFINITY
> h_stack_is_consumable_job=0
> s_stack=INFINITY
> s_stack_is_consumable_job=0
> h_data=INFINITY
> h_data_is_consumable_job=0
> s_data=INFINITY
> s_data_is_consumable_job=0
> h_core=INFINITY
> s_core=INFINITY
> h_rss=INFINITY
> s_rss=INFINITY
> h_fsize=INFINITY
> s_fsize=INFINITY
> s_descriptors=UNDEFINED
> h_descriptors=UNDEFINED
> s_maxproc=UNDEFINED
> h_maxproc=UNDEFINED
> s_memorylocked=UNDEFINED
> h_memorylocked=UNDEFINED
> s_locks=UNDEFINED
> h_locks=UNDEFINED
> priority=0
> shell_path=/bin/bash
> script_file=/var/spool/gridengine/execd/sgeqexec01/job_scripts/128
> job_owner=pwbcad
> min_gid=0
> min_uid=0
> cwd=/mnt/FacilityBioinformatics/pwbcad
> prolog=none
> epilog=none
> starter_method=NONE
> suspend_method=NONE
> resume_method=NONE
> terminate_method=NONE
> script_timeout=120
> pe=orte
> pe_slots=16
> host_slots=8
> pe_hostfile=/var/spool/gridengine/execd/sgeqexec01/active_jobs/128.1/pe_hostfile
> pe_start=/bin/true
> pe_stop=/bin/true
> pe_stdout_path=/mnt/FacilityBioinformatics/pwbcad
> pe_stderr_path=/mnt/FacilityBioinformatics/pwbcad
> shell_start_mode=posix_compliant
> use_login_shell=1
> mail_list=pwb...@enzo.garvan.unsw.edu.au
> mail_options=0
> forbid_reschedule=0
> forbid_apperror=0
> queue=dev.q
> host=sgeqexec01.garvan.unsw.edu.au
> processors=UNDEFINED
> binding=NULL
> job_name=run_cal_pi_auto
> job_id=128
> ja_task_id=0
> account=sge
> submission_time=1302987873
> notify=0
> acct_project=none
> njob_args=0
> queue_tmpdir=/tmp
> use_afs=0
> admin_user=sgeadmin
> notify_kill_type=1
> notify_kill=default
> notify_susp_type=1
> notify_susp=default
> qsub_gid=no
> pty=0
> write_osjob_id=1
> inherit_env=1
> enable_windomacc=0
> 

Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)

2011-04-16 Thread Derrick LIN
>
> So you route the SGE startup mechanism to use `ssh`, nevertherless is
> should work of course. Small difference to a conventional `ssh` is, that SGE
> will start a private daemon for each job on the nodes listening on a random
> port.
>
> When you use only one host, then forks will be created but no `ssh` call.
> Your test uses more than one node?
>

I have tested with more than one node but the error still happened.

You copied you SGE aware version to all nodes at the same location? Are you
> getting the correct `mpiexec` and shared libraries in your jobscript? Shows
> the output of:
>

I installed it from the ubuntu apt-get on each node, so the OpenMPI is in
standard location. In fact ubuntu handles all dependencies very well without
worrying about PATH or LD_LIBRARY_PATH.


> #!/bin/sh
> which mpiexec
> echo $LD_LIBRARY_PATH
> ldd ompi_job
>
> the expected ones (ompi_job is the binary and ompi_job.sh the script) when
> submitted with a PE request?
>

/usr/bin/mpiexec
/usr/lib/openmpi/lib:/usr/lib/openmpi/lib/openmpi
linux-vdso.so.1 =>  (0x7fff9b1ff000)
libmpi.so.0 => /usr/lib/libmpi.so.0 (0x2af0868aa000)
libopen-rte.so.0 => /usr/lib/libopen-rte.so.0 (0x2af086b58000)
libopen-pal.so.0 => /usr/lib/libopen-pal.so.0 (0x2af086da4000)
libdl.so.2 => /lib/libdl.so.2 (0x2af087017000)
libnsl.so.1 => /lib/libnsl.so.1 (0x2af08721b000)
libutil.so.1 => /lib/libutil.so.1 (0x2af087436000)
libm.so.6 => /lib/libm.so.6 (0x2af087639000)
libpthread.so.0 => /lib/libpthread.so.0 (0x2af0878bc000)
libc.so.6 => /lib/libc.so.6 (0x2af087ada000)
/lib64/ld-linux-x86-64.so.2 (0x2af086687000)

Below are some runtime data inside a job spooling directory on the execution
host

pwbcad@sgeqexec01:128.1$ ls
addgrpid  config  environment  error  exit_status  job_pid  pe_hostfile  pid
 trace  usage
*pwbcad@sgeqexec01:128.1$ cat config*
add_grp_id=65416
fs_stdin_host=""
fs_stdin_path=
fs_stdin_tmp_path=/tmp/128.1.dev.q/
fs_stdin_file_staging=0
fs_stdout_host=""
fs_stdout_path=
fs_stdout_tmp_path=/tmp/128.1.dev.q/
fs_stdout_file_staging=0
fs_stderr_host=""
fs_stderr_path=
fs_stderr_tmp_path=/tmp/128.1.dev.q/
fs_stderr_file_staging=0
stdout_path=/mnt/FacilityBioinformatics/pwbcad
stderr_path=/mnt/FacilityBioinformatics/pwbcad
stdin_path=/dev/null
merge_stderr=1
tmpdir=/tmp/128.1.dev.q
handle_as_binary=0
no_shell=0
ckpt_job=0
h_vmem=INFINITY
h_vmem_is_consumable_job=0
s_vmem=INFINITY
s_vmem_is_consumable_job=0
h_cpu=INFINITY
h_cpu_is_consumable_job=0
s_cpu=INFINITY
s_cpu_is_consumable_job=0
h_stack=INFINITY
h_stack_is_consumable_job=0
s_stack=INFINITY
s_stack_is_consumable_job=0
h_data=INFINITY
h_data_is_consumable_job=0
s_data=INFINITY
s_data_is_consumable_job=0
h_core=INFINITY
s_core=INFINITY
h_rss=INFINITY
s_rss=INFINITY
h_fsize=INFINITY
s_fsize=INFINITY
s_descriptors=UNDEFINED
h_descriptors=UNDEFINED
s_maxproc=UNDEFINED
h_maxproc=UNDEFINED
s_memorylocked=UNDEFINED
h_memorylocked=UNDEFINED
s_locks=UNDEFINED
h_locks=UNDEFINED
priority=0
shell_path=/bin/bash
script_file=/var/spool/gridengine/execd/sgeqexec01/job_scripts/128
job_owner=pwbcad
min_gid=0
min_uid=0
cwd=/mnt/FacilityBioinformatics/pwbcad
prolog=none
epilog=none
starter_method=NONE
suspend_method=NONE
resume_method=NONE
terminate_method=NONE
script_timeout=120
pe=orte
pe_slots=16
host_slots=8
pe_hostfile=/var/spool/gridengine/execd/sgeqexec01/active_jobs/128.1/pe_hostfile
pe_start=/bin/true
pe_stop=/bin/true
pe_stdout_path=/mnt/FacilityBioinformatics/pwbcad
pe_stderr_path=/mnt/FacilityBioinformatics/pwbcad
shell_start_mode=posix_compliant
use_login_shell=1
mail_list=pwb...@enzo.garvan.unsw.edu.au
mail_options=0
forbid_reschedule=0
forbid_apperror=0
queue=dev.q
host=sgeqexec01.garvan.unsw.edu.au
processors=UNDEFINED
binding=NULL
job_name=run_cal_pi_auto
job_id=128
ja_task_id=0
account=sge
submission_time=1302987873
notify=0
acct_project=none
njob_args=0
queue_tmpdir=/tmp
use_afs=0
admin_user=sgeadmin
notify_kill_type=1
notify_kill=default
notify_susp_type=1
notify_susp=default
qsub_gid=no
pty=0
write_osjob_id=1
inherit_env=1
enable_windomacc=0
enable_addgrp_kill=0
csp=0
ignore_fqdn=0
default_domain=none
*pwbcad@sgeqexec01:128.1$ cat environment*
USER=pwbcad
SSH_CLIENT=149.171.200.64 63056 22
MAIL=/var/mail/pwbcad
SHLVL=1
OLDPWD=/home/pwbcad
HOME=/home/pwbcad
SSH_TTY=/dev/pts/4
PAGER=less
PS1=\[\e[32;1m\]\u\[\e[0m\]@\[\e[35;1m\]\h\[\e[0m\]:\[\e[34;1m\]\W\[\e[0m\]\$
LOGNAME=pwbcad
_=/usr/bin/qsub
TERM=xterm
SGE_ROOT=/var/lib/gridengine
PATH=/tmp/128.1.dev.q:.:/home/pwbcad/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/meme/bin:/usr/local/eigenstrat:/usr/local/tophat/bin:/usr/local/cufflinks/bin:/usr/local/defuse/bin:/usr/local/bowtie/bin:/usr/local/cnvseq/bin:/usr/local/fastx_toolkit/bin:/usr/local/breakway/bin
SGE_CELL=default
LANG=en_AU.UTF-8
SHELL=/bin/bash
PWD=/mnt/FacilityBioinformatics/pwbcad

Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)

2011-04-15 Thread Reuti
Am 15.04.2011 um 23:02 schrieb Derrick LIN:

> - what is your SGE configuration `qconf -sconf`?
>  
> 
> rlogin_daemon/usr/sbin/sshd -i
> rlogin_command   /usr/bin/ssh
> qlogin_daemon/usr/sbin/sshd -i
> qlogin_command   /usr/share/gridengine/qlogin-wrapper
> rsh_daemon   /usr/sbin/sshd -i
> rsh_command  /usr/bin/ssh

So you route the SGE startup mechanism to use `ssh`, nevertherless is should 
work of course. Small difference to a conventional `ssh` is, that SGE will 
start a private daemon for each job on the nodes listening on a random port.

When you use only one host, then forks will be created but no `ssh` call. Your 
test uses more than one node?

You copied you SGE aware version to all nodes at the same location? Are you 
getting the correct `mpiexec` and shared libraries in your jobscript? Shows the 
output of:

#!/bin/sh
which mpiexec
echo $LD_LIBRARY_PATH
ldd ompi_job

the expected ones (ompi_job is the binary and ompi_job.sh the script) when 
submitted with a PE request?

-- Reuti


> jsv_url  none
> jsv_allowed_mod  ac,h,i,e,o,j,M,N,p,w
> 
> # my queue setting is:
> 
> qname dev.q
> hostlist  sgeqexec01.domain.com.au
> seq_no0
> load_thresholds   np_load_avg=1.75
> suspend_thresholdsNONE
> nsuspend  1
> suspend_interval  00:05:00
> priority  0
> min_cpu_interval  00:05:00
> processorsUNDEFINED
> qtype BATCH INTERACTIVE
> ckpt_list NONE
> pe_list   make orte
> rerun FALSE
> slots 8
> tmpdir/tmp
> shell /bin/bash
> prologNONE
> epilogNONE
> shell_start_mode  posix_compliant
> starter_methodNONE
> suspend_methodNONE
> resume_method NONE
> terminate_method  NONE
> notify00:00:60
> owner_listNONE
> user_listsNONE
> xuser_lists   NONE
> subordinate_list  NONE
> complex_valuesNONE
> projects  NONE
> xprojects NONE
> calendar  NONE
> initial_state default
> s_rt  INFINITY
> h_rt  INFINITY
> s_cpu INFINITY
> h_cpu INFINITY
> s_fsize   INFINITY
> h_fsize   INFINITY
> s_dataINFINITY
> h_dataINFINITY
> s_stack   INFINITY
> h_stack   INFINITY
> s_coreINFINITY
> h_coreINFINITY
> s_rss INFINITY
> h_rss INFINITY
> s_vmemINFINITY
> h_vmemINFINITY
> 
> # my PE setting is:
> 
> pe_nameorte
> slots  4
> user_lists NONE
> xuser_listsNONE
> start_proc_args/bin/true
> stop_proc_args /bin/true
> allocation_rule$round_robin
> control_slaves TRUE
> job_is_first_task  FALSE
> urgency_slots  min
> accounting_summary FALSE
>  
> a) you are testing from master to a node, but jobs are running between nodes.
> 
> b) unless you need X11 forwarding, using SGE’s -builtin- communication works 
> fine, this way you can have a cluster without `rsh` or `ssh` (or limited to 
> admin staff) and can still run parallel jobs.
> 
> Sorry for the misleading snip. All the hosts (both master and execution host) 
> in the cluster can powerwordless each other without an issue. As my 2) 
> states, I could run a generic openmpi job without the SGE successfully. So I 
> do not think is the communication issue?
>  
> Then you are bypassing SGE’s slot allocation and will have wrong accounting 
> and no job control of the slave tasks.
>  
> I know it's not a proper submission as a PE job. I simply ran out of idea 
> what to do next. Even it's not a proper way, but that openmpi error didn't 
> happen and the job completed. I am wondering why.
> 
> 
> The correct version of my OpenMPI is 1.4.1, not 1.3 in my first post.
> 
> I have installed OpenMPI on the submission host and the master later, but it 
> didn't help. So I guess OpenMPI is needed in execution hosts only.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)

2011-04-15 Thread Derrick LIN
>
> - what is your SGE configuration `qconf -sconf`?


#global:
execd_spool_dir  /var/spool/gridengine/execd
mailer   /usr/bin/mail
xterm/usr/bin/xterm
load_sensor  none
prolog   none
epilog   none
shell_start_mode posix_compliant
login_shells bash,sh,ksh,csh,tcsh
min_uid  0
min_gid  0
user_lists   none
xuser_lists  none
projects none
xprojectsnone
enforce_project  false
enforce_user auto
load_report_time 00:00:40
max_unheard  00:05:00
reschedule_unknown   00:00:00
loglevel log_warning
administrator_mail   root
set_token_cmdnone
pag_cmd  none
token_extend_timenone
shepherd_cmd none
qmaster_params   none
execd_params none
reporting_params accounting=true reporting=false \
 flush_time=00:00:15 joblog=false
sharelog=00:00:00
finished_jobs100
gid_range65400-65500
max_aj_instances 2000
max_aj_tasks 75000
max_u_jobs   0
max_jobs 0
auto_user_oticket0
auto_user_fshare 0
auto_user_default_projectnone
auto_user_delete_time86400
delegated_file_staging   false
reprioritize false
rlogin_daemon/usr/sbin/sshd -i
rlogin_command   /usr/bin/ssh
qlogin_daemon/usr/sbin/sshd -i
qlogin_command   /usr/share/gridengine/qlogin-wrapper
rsh_daemon   /usr/sbin/sshd -i
rsh_command  /usr/bin/ssh
jsv_url  none
jsv_allowed_mod  ac,h,i,e,o,j,M,N,p,w

# my queue setting is:

qname dev.q
hostlist  sgeqexec01.domain.com.au
seq_no0
load_thresholds   np_load_avg=1.75
suspend_thresholdsNONE
nsuspend  1
suspend_interval  00:05:00
priority  0
min_cpu_interval  00:05:00
processorsUNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list   make orte
rerun FALSE
slots 8
tmpdir/tmp
shell /bin/bash
prologNONE
epilogNONE
shell_start_mode  posix_compliant
starter_methodNONE
suspend_methodNONE
resume_method NONE
terminate_method  NONE
notify00:00:60
owner_listNONE
user_listsNONE
xuser_lists   NONE
subordinate_list  NONE
complex_valuesNONE
projects  NONE
xprojects NONE
calendar  NONE
initial_state default
s_rt  INFINITY
h_rt  INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize   INFINITY
h_fsize   INFINITY
s_dataINFINITY
h_dataINFINITY
s_stack   INFINITY
h_stack   INFINITY
s_coreINFINITY
h_coreINFINITY
s_rss INFINITY
h_rss INFINITY
s_vmemINFINITY
h_vmemINFINITY

# my PE setting is:

pe_nameorte
slots  4
user_lists NONE
xuser_listsNONE
start_proc_args/bin/true
stop_proc_args /bin/true
allocation_rule$round_robin
control_slaves TRUE
job_is_first_task  FALSE
urgency_slots  min
accounting_summary FALSE


> a) you are testing from master to a node, but jobs are running between
> nodes.


> b) unless you need X11 forwarding, using SGE’s -builtin- communication
> works fine, this way you can have a cluster without `rsh` or `ssh` (or
> limited to admin staff) and can still run parallel jobs.
>

Sorry for the misleading snip. All the hosts (both master and execution
host) in the cluster can powerwordless each other without an issue. As my 2)
states, I could run a generic openmpi job without the SGE successfully. So I
do not think is the communication issue?


> Then you are bypassing SGE’s slot allocation and will have wrong accounting
> and no job control of the slave tasks.
>

I know it's not a proper submission as a PE job. I simply ran out of idea
what to do next. Even it's not a proper way, but that openmpi error didn't
happen and the job completed. I am wondering why.


The correct version of my OpenMPI is 1.4.1, not 1.3 in my first post.

I have installed OpenMPI on the submission host and the master later, but it
didn't help. So I guess OpenMPI is needed in execution hosts only.