Oddly enough, adding this debugging info, lowered the number of
processes that can be used down to 42 from 46. When I run the MPI, it
fails giving only the information that follows:
[root@compute-2-1 ssh]# /home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -v -np 44 --leave-session-attached -mca
odls_base_verbose 5 hostname
[compute-2-1.local:44374] mca:base:select:( odls) Querying component
[default]
[compute-2-1.local:44374] mca:base:select:( odls) Query of component
[default] set priority to 1
[compute-2-1.local:44374] mca:base:select:( odls) Selected component
[default]
[compute-2-0.local:28950] mca:base:select:( odls) Querying component
[default]
[compute-2-0.local:28950] mca:base:select:( odls) Query of component
[default] set priority to 1
[compute-2-0.local:28950] mca:base:select:( odls) Selected component
[default]
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
On 12/14/2012 03:18 PM, Ralph Castain wrote:
It wouldn't be ssh - in both cases, only one ssh is being done to each node (to
start the local daemon). The only difference is the number of fork/exec's being
done on each node, and the number of file descriptors being opened to support
those fork/exec's.
It certainly looks like your limits are high enough. When you say it "fails",
what do you mean - what error does it report? Try adding:
--leave-session-attached -mca odls_base_verbose 5
to your cmd line - this will report all the local proc launch debug and
hopefully show you a more detailed error report.
On Dec 14, 2012, at 12:29 PM, Daniel Davidson <dani...@igb.uiuc.edu> wrote:
I have had to cobble together two machines in our rocks cluster without using
the standard installation, they have efi only bios on them and rocks doesnt
like that, so it is the only workaround.
Everything works great now, except for one thing. MPI jobs (openmpi or mpich)
fail when started from one of these nodes (via qsub or by logging in and
running the command) if 24 or more processors are needed on another system.
However if the originator of the MPI job is the headnode or any of the
preexisting compute nodes, it works fine. Right now I am guessing ssh client
or ulimit problems, but I cannot find any difference. Any help would be
greatly appreciated.
compute-2-1 and compute-2-0 are the new nodes
Examples:
This works, prints 23 hostnames from each machine:
[root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -np 46 hostname
This does not work, prints 24 hostnames for compute-2-1
[root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -np 48 hostname
These both work, print 64 hostnames from each node
[root@biocluster ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -np 128 hostname
[root@compute-0-2 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -np 128 hostname
[root@compute-2-1 ~]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 16410016
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 4096
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
[root@compute-2-1 ~]# more /etc/ssh/ssh_config
Host *
CheckHostIP no
ForwardX11 yes
ForwardAgent yes
StrictHostKeyChecking no
UsePrivilegedPort no
Protocol 2,1
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users