Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)

2011-04-16 Thread Derrick LIN
> Well, does `mpiexec` point to the correct one?

I don't really get this. I only installed one and only one OpenMPI on the
node. There shouldn't have another 'mpiexec' on the system.

It's worthy to mention that every node is deployed from a master image. So
everything is exactly the same except IP and DNS name.

> I thought you compiled it on your own with --with-sge. What about:
pwbcad@sgeqexec01:~$ ompi_info | grep grid
 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.1)

Is there any location I can find a more meaningful OpenMPI log?

I will try to install openmpi 1.4.3 and see if that works.

I want to confirm one more thing: does SGE's master host need to have
OpenMPI installed? Is it relevant?

Many thanks Reuti

Derrick


Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)

2011-04-16 Thread Reuti
Am 16.04.2011 um 23:09 schrieb Derrick LIN:

> So you route the SGE startup mechanism to use `ssh`, nevertherless is should 
> work of course. Small difference to a conventional `ssh` is, that SGE will 
> start a private daemon for each job on the nodes listening on a random port.
> 
> When you use only one host, then forks will be created but no `ssh` call. 
> Your test uses more than one node?
> 
> I have tested with more than one node but the error still happened. 
> 
> You copied you SGE aware version to all nodes at the same location? Are you 
> getting the correct `mpiexec` and shared libraries in your jobscript? Shows 
> the output of:
> 
> I installed it from the ubuntu apt-get on each node, so the OpenMPI is in 
> standard location. In fact ubuntu handles all dependencies very well without 
> worrying about PATH or LD_LIBRARY_PATH.

Well, does `mpiexec` point to the correct one? 

I thought you compiled it on your own with --with-sge. What about:

$ ompi_info | grep grid
 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.3)

You have this on all nodes and your binary was compiled with this version?

All stuff below looks fine.

You can even try to start "from scratch" with a private copy of Open MPI which 
you install for example in $HOME/local/openmpi-1.4.3 and set the paths 
accordingly.

-- Reuti


> #!/bin/sh
> which mpiexec
> echo $LD_LIBRARY_PATH
> ldd ompi_job
> 
> the expected ones (ompi_job is the binary and ompi_job.sh the script) when 
> submitted with a PE request?
> 
> /usr/bin/mpiexec
> /usr/lib/openmpi/lib:/usr/lib/openmpi/lib/openmpi
> linux-vdso.so.1 =>  (0x7fff9b1ff000)
> libmpi.so.0 => /usr/lib/libmpi.so.0 (0x2af0868aa000)
> libopen-rte.so.0 => /usr/lib/libopen-rte.so.0 (0x2af086b58000)
> libopen-pal.so.0 => /usr/lib/libopen-pal.so.0 (0x2af086da4000)
> libdl.so.2 => /lib/libdl.so.2 (0x2af087017000)
> libnsl.so.1 => /lib/libnsl.so.1 (0x2af08721b000)
> libutil.so.1 => /lib/libutil.so.1 (0x2af087436000)
> libm.so.6 => /lib/libm.so.6 (0x2af087639000)
> libpthread.so.0 => /lib/libpthread.so.0 (0x2af0878bc000)
> libc.so.6 => /lib/libc.so.6 (0x2af087ada000)
> /lib64/ld-linux-x86-64.so.2 (0x2af086687000)
> 
> Below are some runtime data inside a job spooling directory on the execution 
> host
> 
> pwbcad@sgeqexec01:128.1$ ls
> addgrpid  config  environment  error  exit_status  job_pid  pe_hostfile  pid  
> trace  usage
> pwbcad@sgeqexec01:128.1$ cat config
> add_grp_id=65416
> fs_stdin_host=""
> fs_stdin_path=
> fs_stdin_tmp_path=/tmp/128.1.dev.q/
> fs_stdin_file_staging=0
> fs_stdout_host=""
> fs_stdout_path=
> fs_stdout_tmp_path=/tmp/128.1.dev.q/
> fs_stdout_file_staging=0
> fs_stderr_host=""
> fs_stderr_path=
> fs_stderr_tmp_path=/tmp/128.1.dev.q/
> fs_stderr_file_staging=0
> stdout_path=/mnt/FacilityBioinformatics/pwbcad
> stderr_path=/mnt/FacilityBioinformatics/pwbcad
> stdin_path=/dev/null
> merge_stderr=1
> tmpdir=/tmp/128.1.dev.q
> handle_as_binary=0
> no_shell=0
> ckpt_job=0
> h_vmem=INFINITY
> h_vmem_is_consumable_job=0
> s_vmem=INFINITY
> s_vmem_is_consumable_job=0
> h_cpu=INFINITY
> h_cpu_is_consumable_job=0
> s_cpu=INFINITY
> s_cpu_is_consumable_job=0
> h_stack=INFINITY
> h_stack_is_consumable_job=0
> s_stack=INFINITY
> s_stack_is_consumable_job=0
> h_data=INFINITY
> h_data_is_consumable_job=0
> s_data=INFINITY
> s_data_is_consumable_job=0
> h_core=INFINITY
> s_core=INFINITY
> h_rss=INFINITY
> s_rss=INFINITY
> h_fsize=INFINITY
> s_fsize=INFINITY
> s_descriptors=UNDEFINED
> h_descriptors=UNDEFINED
> s_maxproc=UNDEFINED
> h_maxproc=UNDEFINED
> s_memorylocked=UNDEFINED
> h_memorylocked=UNDEFINED
> s_locks=UNDEFINED
> h_locks=UNDEFINED
> priority=0
> shell_path=/bin/bash
> script_file=/var/spool/gridengine/execd/sgeqexec01/job_scripts/128
> job_owner=pwbcad
> min_gid=0
> min_uid=0
> cwd=/mnt/FacilityBioinformatics/pwbcad
> prolog=none
> epilog=none
> starter_method=NONE
> suspend_method=NONE
> resume_method=NONE
> terminate_method=NONE
> script_timeout=120
> pe=orte
> pe_slots=16
> host_slots=8
> pe_hostfile=/var/spool/gridengine/execd/sgeqexec01/active_jobs/128.1/pe_hostfile
> pe_start=/bin/true
> pe_stop=/bin/true
> pe_stdout_path=/mnt/FacilityBioinformatics/pwbcad
> pe_stderr_path=/mnt/FacilityBioinformatics/pwbcad
> shell_start_mode=posix_compliant
> use_login_shell=1
> mail_list=pwb...@enzo.garvan.unsw.edu.au
> mail_options=0
> forbid_reschedule=0
> forbid_apperror=0
> queue=dev.q
> host=sgeqexec01.garvan.unsw.edu.au
> processors=UNDEFINED
> binding=NULL
> job_name=run_cal_pi_auto
> job_id=128
> ja_task_id=0
> account=sge
> submission_time=1302987873
> notify=0
> acct_project=none
> njob_args=0
> queue_tmpdir=/tmp
> use_afs=0
> admin_user=sgeadmin
> notify_kill_type=1
> notify_kill=default
> notify_susp_type=1
> notify_susp=default
> qsub_gid=no
> pty=0
> write_osjob_id=1
> inherit_env=1
> enable_windomacc=0
> 

Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)

2011-04-16 Thread Derrick LIN
>
> So you route the SGE startup mechanism to use `ssh`, nevertherless is
> should work of course. Small difference to a conventional `ssh` is, that SGE
> will start a private daemon for each job on the nodes listening on a random
> port.
>
> When you use only one host, then forks will be created but no `ssh` call.
> Your test uses more than one node?
>

I have tested with more than one node but the error still happened.

You copied you SGE aware version to all nodes at the same location? Are you
> getting the correct `mpiexec` and shared libraries in your jobscript? Shows
> the output of:
>

I installed it from the ubuntu apt-get on each node, so the OpenMPI is in
standard location. In fact ubuntu handles all dependencies very well without
worrying about PATH or LD_LIBRARY_PATH.


> #!/bin/sh
> which mpiexec
> echo $LD_LIBRARY_PATH
> ldd ompi_job
>
> the expected ones (ompi_job is the binary and ompi_job.sh the script) when
> submitted with a PE request?
>

/usr/bin/mpiexec
/usr/lib/openmpi/lib:/usr/lib/openmpi/lib/openmpi
linux-vdso.so.1 =>  (0x7fff9b1ff000)
libmpi.so.0 => /usr/lib/libmpi.so.0 (0x2af0868aa000)
libopen-rte.so.0 => /usr/lib/libopen-rte.so.0 (0x2af086b58000)
libopen-pal.so.0 => /usr/lib/libopen-pal.so.0 (0x2af086da4000)
libdl.so.2 => /lib/libdl.so.2 (0x2af087017000)
libnsl.so.1 => /lib/libnsl.so.1 (0x2af08721b000)
libutil.so.1 => /lib/libutil.so.1 (0x2af087436000)
libm.so.6 => /lib/libm.so.6 (0x2af087639000)
libpthread.so.0 => /lib/libpthread.so.0 (0x2af0878bc000)
libc.so.6 => /lib/libc.so.6 (0x2af087ada000)
/lib64/ld-linux-x86-64.so.2 (0x2af086687000)

Below are some runtime data inside a job spooling directory on the execution
host

pwbcad@sgeqexec01:128.1$ ls
addgrpid  config  environment  error  exit_status  job_pid  pe_hostfile  pid
 trace  usage
*pwbcad@sgeqexec01:128.1$ cat config*
add_grp_id=65416
fs_stdin_host=""
fs_stdin_path=
fs_stdin_tmp_path=/tmp/128.1.dev.q/
fs_stdin_file_staging=0
fs_stdout_host=""
fs_stdout_path=
fs_stdout_tmp_path=/tmp/128.1.dev.q/
fs_stdout_file_staging=0
fs_stderr_host=""
fs_stderr_path=
fs_stderr_tmp_path=/tmp/128.1.dev.q/
fs_stderr_file_staging=0
stdout_path=/mnt/FacilityBioinformatics/pwbcad
stderr_path=/mnt/FacilityBioinformatics/pwbcad
stdin_path=/dev/null
merge_stderr=1
tmpdir=/tmp/128.1.dev.q
handle_as_binary=0
no_shell=0
ckpt_job=0
h_vmem=INFINITY
h_vmem_is_consumable_job=0
s_vmem=INFINITY
s_vmem_is_consumable_job=0
h_cpu=INFINITY
h_cpu_is_consumable_job=0
s_cpu=INFINITY
s_cpu_is_consumable_job=0
h_stack=INFINITY
h_stack_is_consumable_job=0
s_stack=INFINITY
s_stack_is_consumable_job=0
h_data=INFINITY
h_data_is_consumable_job=0
s_data=INFINITY
s_data_is_consumable_job=0
h_core=INFINITY
s_core=INFINITY
h_rss=INFINITY
s_rss=INFINITY
h_fsize=INFINITY
s_fsize=INFINITY
s_descriptors=UNDEFINED
h_descriptors=UNDEFINED
s_maxproc=UNDEFINED
h_maxproc=UNDEFINED
s_memorylocked=UNDEFINED
h_memorylocked=UNDEFINED
s_locks=UNDEFINED
h_locks=UNDEFINED
priority=0
shell_path=/bin/bash
script_file=/var/spool/gridengine/execd/sgeqexec01/job_scripts/128
job_owner=pwbcad
min_gid=0
min_uid=0
cwd=/mnt/FacilityBioinformatics/pwbcad
prolog=none
epilog=none
starter_method=NONE
suspend_method=NONE
resume_method=NONE
terminate_method=NONE
script_timeout=120
pe=orte
pe_slots=16
host_slots=8
pe_hostfile=/var/spool/gridengine/execd/sgeqexec01/active_jobs/128.1/pe_hostfile
pe_start=/bin/true
pe_stop=/bin/true
pe_stdout_path=/mnt/FacilityBioinformatics/pwbcad
pe_stderr_path=/mnt/FacilityBioinformatics/pwbcad
shell_start_mode=posix_compliant
use_login_shell=1
mail_list=pwb...@enzo.garvan.unsw.edu.au
mail_options=0
forbid_reschedule=0
forbid_apperror=0
queue=dev.q
host=sgeqexec01.garvan.unsw.edu.au
processors=UNDEFINED
binding=NULL
job_name=run_cal_pi_auto
job_id=128
ja_task_id=0
account=sge
submission_time=1302987873
notify=0
acct_project=none
njob_args=0
queue_tmpdir=/tmp
use_afs=0
admin_user=sgeadmin
notify_kill_type=1
notify_kill=default
notify_susp_type=1
notify_susp=default
qsub_gid=no
pty=0
write_osjob_id=1
inherit_env=1
enable_windomacc=0
enable_addgrp_kill=0
csp=0
ignore_fqdn=0
default_domain=none
*pwbcad@sgeqexec01:128.1$ cat environment*
USER=pwbcad
SSH_CLIENT=149.171.200.64 63056 22
MAIL=/var/mail/pwbcad
SHLVL=1
OLDPWD=/home/pwbcad
HOME=/home/pwbcad
SSH_TTY=/dev/pts/4
PAGER=less
PS1=\[\e[32;1m\]\u\[\e[0m\]@\[\e[35;1m\]\h\[\e[0m\]:\[\e[34;1m\]\W\[\e[0m\]\$
LOGNAME=pwbcad
_=/usr/bin/qsub
TERM=xterm
SGE_ROOT=/var/lib/gridengine
PATH=/tmp/128.1.dev.q:.:/home/pwbcad/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/meme/bin:/usr/local/eigenstrat:/usr/local/tophat/bin:/usr/local/cufflinks/bin:/usr/local/defuse/bin:/usr/local/bowtie/bin:/usr/local/cnvseq/bin:/usr/local/fastx_toolkit/bin:/usr/local/breakway/bin
SGE_CELL=default
LANG=en_AU.UTF-8
SHELL=/bin/bash
PWD=/mnt/FacilityBioinformatics/pwbcad

Re: [OMPI users] missing symbols in Windows 1.5.3 binaries?

2011-04-16 Thread Damien

Shiqing,

I'm using Composer XE2011 and Visual Studio 2008.  VS2008 is doing the 
linking.  I'll do a build of 1.5.3 myself and see how the symbols turn out.


Damien

On 16/04/2011 1:50 PM, Shiqing Fan wrote:

Hi Damien,

Which version of Intel MPI do you use? The only difference between 
1.5.3 and .1.52 I can tell is that 1.5.3 was built with Intel Fortran 
Composer XE 2011. That might be the reason of the problem.


Shiqing

On 4/16/2011 4:01 AM, Damien wrote:

Hiya,

I just tested the 1.5.3 binaries and my link pass broke.  Using 1.5.3 
I get unresolved externals on things like _MPI_NULL_COPY_FN.  On 
1.5.2.2 it's fine.  I did a dumpbin on libmpi.lib for both versions, 
and in 1.5.3 there's upper-case symbols for _OMPI_C_MPI_NULL_COPY_FN, 
but not _MPI_NULL_COPY_FN.  In the 1.5.2.2 libmpi.lib there's symbols 
for both.


Damien
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






Re: [OMPI users] missing symbols in Windows 1.5.3 binaries?

2011-04-16 Thread Shiqing Fan

Hi Damien,

Which version of Intel MPI do you use? The only difference between 1.5.3 
and .1.52 I can tell is that 1.5.3 was built with Intel Fortran Composer 
XE 2011. That might be the reason of the problem.


Shiqing

On 4/16/2011 4:01 AM, Damien wrote:

Hiya,

I just tested the 1.5.3 binaries and my link pass broke.  Using 1.5.3 
I get unresolved externals on things like _MPI_NULL_COPY_FN.  On 
1.5.2.2 it's fine.  I did a dumpbin on libmpi.lib for both versions, 
and in 1.5.3 there's upper-case symbols for _OMPI_C_MPI_NULL_COPY_FN, 
but not _MPI_NULL_COPY_FN.  In the 1.5.2.2 libmpi.lib there's symbols 
for both.


Damien
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
---
Shiqing Fan
High Performance Computing Center Stuttgart (HLRS)
Tel: ++49(0)711-685-87234  Nobelstrasse 19
Fax: ++49(0)711-685-65832  70569 Stuttgart
http://www.hlrs.de/organization/people/shiqing-fan/
email: f...@hlrs.de



[OMPI users] Ofed v1.5.3?

2011-04-16 Thread Michael Di Domenico
Does OpenMPI v1.5.3 support Ofed v.1.5.3.1 ?