Re: [OMPI users] Ompi failing on mx only

2007-01-09 Thread Grobe, Gary L. (JSC-EV)[ESCG]
> I need it's the backtrace on the process which generate the 
> segfault. Second, in order to understand the backtrace, it's 
> better to have run debug version of Open MPI. Without the 
> debug version we only see the address where the fault occur 
> without having access to the line number ...

How about this, this is the section that I was stepping through in order
to get the first error I usually run into ... "mx_connect fail for
node-1:0 with key  (error Endpoint closed or not connectable!)"

// gdb output

Breakpoint 1, 0x2ac856bd92e0 in opal_progress ()
   from /usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0
(gdb) s
Single stepping until exit from function opal_progress, 
which has no line number information.
0x2ac857361540 in sched_yield () from /lib/libc.so.6
(gdb) s
Single stepping until exit from function sched_yield, 
which has no line number information.
opal_condition_wait (c=0x5098e0, m=0x5098a0) at condition.h:80
80  while (c->c_signaled == 0) {
(gdb) s
81  opal_progress();
(gdb) s

Breakpoint 1, 0x2ac856bd92e0 in opal_progress ()
   from /usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0
(gdb) s
Single stepping until exit from function opal_progress, 
which has no line number information.
0x2ac857361540 in sched_yield () from /lib/libc.so.6
(gdb) backtrace
#0  0x2ac857361540 in sched_yield () from /lib/libc.so.6
#1  0x00402f60 in opal_condition_wait (c=0x5098e0, m=0x5098a0)
at condition.h:81
#2  0x00402b3c in orterun (argc=17, argv=0x7fff54151088)
at orterun.c:427
#3  0x00402713 in main (argc=17, argv=0x7fff54151088) at
main.c:13

--- This is the mpirun output as I was stepping through it. At the end
of this is the error that the backtrace above shows.

[node-2:11909] top: openmpi-sessions-ggrobe@node-2_0
[node-2:11909] tmp: /tmp
[node-1:10719] procdir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1/0
[node-1:10719] jobdir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1
[node-1:10719] unidir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414
[node-1:10719] top: openmpi-sessions-ggrobe@node-1_0
[node-1:10719] tmp: /tmp
[juggernaut:17414] spawn: in job_state_callback(jobid = 1, state = 0x4)
[juggernaut:17414] Info: Setting up debugger process table for
applications
  MPIR_being_debugged = 0
  MPIR_debug_gate = 0
  MPIR_debug_state = 1
  MPIR_acquired_pre_main = 0
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 6
  MPIR_proctable:
(i, host, exe, pid) = (0, node-1,
/home/ggrobe/Projects/ompi/cpi/./cpi, 10719)
(i, host, exe, pid) = (1, node-1,
/home/ggrobe/Projects/ompi/cpi/./cpi, 10720)
(i, host, exe, pid) = (2, node-1,
/home/ggrobe/Projects/ompi/cpi/./cpi, 10721)
(i, host, exe, pid) = (3, node-1,
/home/ggrobe/Projects/ompi/cpi/./cpi, 10722)
(i, host, exe, pid) = (4, node-2,
/home/ggrobe/Projects/ompi/cpi/./cpi, 11908)
(i, host, exe, pid) = (5, node-2,
/home/ggrobe/Projects/ompi/cpi/./cpi, 11909)
[node-1:10718] sess_dir_finalize: proc session dir not empty - leaving
[node-1:10718] sess_dir_finalize: proc session dir not empty - leaving
[node-1:10721] procdir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1/2
[node-1:10721] jobdir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1
[node-1:10721] unidir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414
[node-1:10721] top: openmpi-sessions-ggrobe@node-1_0
[node-1:10721] tmp: /tmp
[node-1:10720] mx_connect fail for node-1:0 with key  (error
Endpoint closed or not connectable!)



Re: [OMPI users] Ompi failing on mx only

2007-01-08 Thread Grobe, Gary L. (JSC-EV)[ESCG]
> >> PS: Is there any way you can attach to the processes with gdb ? I 
> >> would like to see the backtrace as showed by gdb in order 
> to be able 
> >> to figure out what's wrong there.
> >

I found out that all processes on the 2nd node crash so I just put a 30
second wait before MPI_Init in order to attach gdb and go from there.

The code in cpi starts off as follows (in order to show where the
SIGTERM below is coming from).

MPI_Init(,);
MPI_Comm_size(MPI_COMM_WORLD,);
MPI_Comm_rank(MPI_COMM_WORLD,);
MPI_Get_processor_name(processor_name,);

---

Attaching to process 11856
Reading symbols from /home/ggrobe/Projects/ompi/cpi/cpi...done.
Using host libthread_db library "/lib/libthread_db.so.1".
Reading symbols from
/usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0...done.
Loaded symbols for /usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0
Reading symbols from
/usr/local/openmpi-1.2b3r13030/lib/libopen-rte.so.0...done.
Loaded symbols for /usr/local/openmpi-1.2b3r13030/lib/libopen-rte.so.0
Reading symbols from
/usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0...done.
Loaded symbols for /usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib64/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib64/libutil.so.1...done.
Loaded symbols for /lib/libutil.so.1
Reading symbols from /lib64/libm.so.6...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib64/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 46974166086512 (LWP 11856)]
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x2ab90661e880 in nanosleep () from /lib/libc.so.6
(gdb) break MPI_Init
Breakpoint 1 at 0x2ab905c0c880
(gdb) break MPI_Comm_size
Breakpoint 2 at 0x2ab905c01af0
(gdb) continue
Continuing.
[Switching to Thread 46974166086512 (LWP 11856)]

Breakpoint 1, 0x2ab905c0c880 in PMPI_Init ()
   from /usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0
(gdb) n
Single stepping until exit from function PMPI_Init, 
which has no line number information.
[New Thread 1082132816 (LWP 11862)]

Program received signal SIGTERM, Terminated.
0x2ab906643f47 in ioctl () from /lib/libc.so.6
(gdb) backtrace
#0  0x2ab906643f47 in ioctl () from /lib/libc.so.6
Cannot access memory at address 0x7fffa50102f8
---

Does this help in anyway?



Re: [OMPI users] Ompi failing on mx only

2007-01-08 Thread Grobe, Gary L. (JSC-EV)[ESCG]
> >> PS: Is there any way you can attach to the processes with gdb ? I 
> >> would like to see the backtrace as showed by gdb in order 
> to be able 
> >> to figure out what's wrong there.
> >
> > When I can get more detailed dbg, I'll send. Though I'm not 
> clear on 
> > what executable is being searched for below.
> >
> > $ mpirun -dbg=gdb --prefix /usr/local/openmpi-1.2b3r13030 -x 
> > LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 
> --mca pml 
> > cm --mca mtl mx ./cpi
> 
> FWIW, note that "-dbg" is not a recognized Open MPI mpirun 
> command line switch -- after all the debugging information, 
> Open MPI finally gets to telling you:
> 

Sorry, wrong mpi, ok ... Fwiw, here's a working crash w/ just the -d
option. The problem I'm trying to get to right now is how to dbg the 2nd
process on the 2nd node since that's where the crash is always
happening. One process past the 1st node works find (5 procs w/ 4 per
node), but when a second process on the 2nd node starts or anything more
than that, the crashes will occur.

$ mpirun -d --prefix /usr/local/openmpi-1.2b3r13030 -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 6 --mca pml cm
--mca mtl mx ./cpi > dbg.out 2>&1

[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] [0,0,0] setting up session dir with
[juggernaut:15087]  universe default-universe-15087
[juggernaut:15087]  user ggrobe
[juggernaut:15087]  host juggernaut
[juggernaut:15087]  jobid 0
[juggernaut:15087]  procid 0
[juggernaut:15087] procdir:
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-15087/0/0
[juggernaut:15087] jobdir:
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-15087/0
[juggernaut:15087] unidir:
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-15087
[juggernaut:15087] top: openmpi-sessions-ggrobe@juggernaut_0
[juggernaut:15087] tmp: /tmp
[juggernaut:15087] [0,0,0] contact_file
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-15087/univers
e-setup.txt
[juggernaut:15087] [0,0,0] wrote setup file
[juggernaut:15087] pls:rsh: local csh: 0, local sh: 1
[juggernaut:15087] pls:rsh: assuming same remote shell as local shell
[juggernaut:15087] pls:rsh: remote csh: 0, remote sh: 1
[juggernaut:15087] pls:rsh: final template argv:
[juggernaut:15087] pls:rsh: /usr/bin/ssh  orted --debug
--bootproxy 1 --name  --num_procs 3 --vpid_start 0 --nodename
 --universe ggrobe@juggernaut:default-universe-15087
--nsreplica "0.0.0;tcp://192.168.2.10:52099" --gprreplica
"0.0.0;tcp://192.168.2.10:52099"
[juggernaut:15087] pls:rsh: launching on node node-1
[juggernaut:15087] pls:rsh: node-1 is a REMOTE node
[juggernaut:15087] pls:rsh: executing: /usr/bin/ssh node-1
PATH=/usr/local/openmpi-1.2b3r13030/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/usr/local/openmpi-1.2b3r13030/lib:$LD_LIBRARY_PATH ;
export LD_LIBRARY_PATH ; /usr/local/openmpi-1.2b3r13030/bin/orted
--debug --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0
--nodename node-1 --universe ggrobe@juggernaut:default-universe-15087
--nsreplica "0.0.0;tcp://192.168.2.10:52099" --gprreplica
"0.0.0;tcp://192.168.2.10:52099"
[juggernaut:15087] pls:rsh: launching on node node-2
[juggernaut:15087] pls:rsh: node-2 is a REMOTE node
[juggernaut:15087] pls:rsh: executing: /usr/bin/ssh node-2
PATH=/usr/local/openmpi-1.2b3r13030/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/usr/local/openmpi-1.2b3r13030/lib:$LD_LIBRARY_PATH ;
export LD_LIBRARY_PATH ; /usr/local/openmpi-1.2b3r13030/bin/orted
--debug --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0
--nodename node-2 --universe ggrobe@juggernaut:default-universe-15087
--nsreplica "0.0.0;tcp://192.168.2.10:52099" --gprreplica
"0.0.0;tcp://192.168.2.10:52099"
[node-2:11499] [0,0,2] setting up session dir with
[node-2:11499]  universe default-universe-15087
[node-2:11499]  user ggrobe
[node-2:11499]  host node-2
[node-2:11499]  jobid 0
[node-2:11499]  procid 2
[node-1:10307] procdir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087/0/1
[node-1:10307] jobdir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087/0
[node-1:10307] unidir:
/tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087
[node-1:10307] top: openmpi-sessions-ggrobe@node-1_0
[node-2:11499] procdir:
/tmp/openmpi-sessions-ggrobe@node-2_0/default-universe-15087/0/2
[node-2:11499] jobdir:
/tmp/openmpi-sessions-ggrobe@node-2_0/default-universe-15087/0
[node-2:11499] unidir:
/tmp/openmpi-sessions-ggrobe@node-2_0/default-universe-15087
[node-2:11499] top: openmpi-sessions-ggrobe@node-2_0
[node-2:11499] tmp: 

Re: [OMPI users] Ompi failing on mx only

2007-01-08 Thread Grobe, Gary L. (JSC-EV)[ESCG]
I was wondering if someone could send me the HACKING file so I can do a
bit more with debugging on the snapshots. Our web proxy has webdav
methods turned off (request methods fail) so that I can't get to the
latest of the svn repos.

> Second thing. From one of your previous emails, I see that MX 
> is configured with 4 instance by node. Your running with 
> exactly 4 processes on the first 2 nodes. Weirds things might 
> happens ...

Just curious about this comment. Are you referring to over subscribing?
We run 4 processes on each node because we have 2 dual core cpu's on
each node. Am I not understanding processor counts correctly?

> PS: Is there any way you can attach to the processes with gdb 
> ? I would like to see the backtrace as showed by gdb in order 
> to be able to figure out what's wrong there.

When I can get more detailed dbg, I'll send. Though I'm not clear on
what executable is being searched for below.

$ mpirun -dbg=gdb --prefix /usr/local/openmpi-1.2b3r13030 -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca pml cm
--mca mtl mx ./cpi

[juggernaut:14949] connect_uni: connection not allowed
[juggernaut:14949] connect_uni: connection not allowed
[juggernaut:14949] connect_uni: connection not allowed
[juggernaut:14949] connect_uni: connection not allowed
[juggernaut:14949] connect_uni: connection not allowed
[juggernaut:14949] connect_uni: connection not allowed
[juggernaut:14949] connect_uni: connection not allowed
[juggernaut:14949] connect_uni: connection not allowed
[juggernaut:14949] connect_uni: connection not allowed
[juggernaut:14949] [0,0,0] setting up session dir with
[juggernaut:14949]  universe default-universe-14949
[juggernaut:14949]  user ggrobe
[juggernaut:14949]  host juggernaut
[juggernaut:14949]  jobid 0
[juggernaut:14949]  procid 0
[juggernaut:14949] procdir:
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-14949/0/0
[juggernaut:14949] jobdir:
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-14949/0
[juggernaut:14949] unidir:
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-14949
[juggernaut:14949] top: openmpi-sessions-ggrobe@juggernaut_0
[juggernaut:14949] tmp: /tmp
[juggernaut:14949] [0,0,0] contact_file
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-14949/univers
e-setup.txt
[juggernaut:14949] [0,0,0] wrote setup file
[juggernaut:14949] pls:rsh: local csh: 0, local sh: 1
[juggernaut:14949] pls:rsh: assuming same remote shell as local shell
[juggernaut:14949] pls:rsh: remote csh: 0, remote sh: 1
[juggernaut:14949] pls:rsh: final template argv:
[juggernaut:14949] pls:rsh: /usr/bin/ssh  orted --debug
--bootproxy 1 --name  --num_procs 2 --vpid_start 0 --nodename
 --universe ggrobe@juggernaut:default-universe-14949
--nsreplica "0.0.0;tcp://192.168.2.10:43121" --gprreplica
"0.0.0;tcp://192.168.2.10:43121"
[juggernaut:14949] pls:rsh: launching on node juggernaut
[juggernaut:14949] pls:rsh: juggernaut is a LOCAL node
[juggernaut:14949] pls:rsh: changing to directory /home/ggrobe
[juggernaut:14949] pls:rsh: executing: orted --debug --bootproxy 1
--name 0.0.1 --num_procs 2 --vpid_start 0 --nodename juggernaut
--universe ggrobe@juggernaut:default-universe-14949 --nsreplica
"0.0.0;tcp://192.168.2.10:43121" --gprreplica
"0.0.0;tcp://192.168.2.10:43121"
[juggernaut:14950] [0,0,1] setting up session dir with
[juggernaut:14950]  universe default-universe-14949
[juggernaut:14950]  user ggrobe
[juggernaut:14950]  host juggernaut
[juggernaut:14950]  jobid 0
[juggernaut:14950]  procid 1
[juggernaut:14950] procdir:
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-14949/0/1
[juggernaut:14950] jobdir:
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-14949/0
[juggernaut:14950] unidir:
/tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-14949
[juggernaut:14950] top: openmpi-sessions-ggrobe@juggernaut_0
[juggernaut:14950] tmp: /tmp

--
Failed to find the following executable:

Host:   juggernaut
Executable: -b

Cannot continue.

--
[juggernaut:14950] [0,0,1] ORTE_ERROR_LOG: Fatal in file
odls_default_module.c at line 1193
[juggernaut:14949] spawn: in job_state_callback(jobid = 1, state = 0x80)
[juggernaut:14950] [0,0,1] ORTE_ERROR_LOG: Fatal in file orted.c at line
575
[juggernaut:14950] sess_dir_finalize: job session dir not empty -
leaving
[juggernaut:14950] sess_dir_finalize: proc session dir not empty -
leaving
[juggernaut:14949] sess_dir_finalize: proc session dir not empty -
leaving






Re: [OMPI users] Ompi failing on mx only

2007-01-05 Thread Grobe, Gary L. (JSC-EV)[ESCG]
Ok, sorry about that last. I think someone just bumped up the required
version of Automake. 

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Grobe, Gary L. (JSC-EV)[ESCG]
Sent: Friday, January 05, 2007 2:29 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only

This is just an FYI of the Jan 5th snapshot.

I'll send a backtrace of the processes as soon as I get a b3 running.
Between my filtered webdav svn access problems and the latest nightly
snapshots, my builds are currently failing where the same config lines
worked on previous snapshots ...

$./configure --prefix=/usr/local/openmpi-1.2b3r13006 --with-mx=/opt/mx
--with-mx-libdir=/opt/mx/lib ...

*** GNU libltdl setup
configure: OMPI configuring in opal/libltdl
configure: running /bin/sh './configure'
'--prefix=/usr/local/openmpi-1.2b3r13006' '--with-mx=/opt/mx'
'--with-mx-libdir=/opt/mx/lib' 'F77=ifort' --enable-ltdl-convenience
--disable-ltdl-install --enable-shared --disable-static
--cache-file=/dev/null --srcdir=.
checking for a BSD-compatible install... /usr/bin/install -c checking
whether build environment is sane... configure: error: newly created
file is older than distributed files!
Check your system clock
configure: /bin/sh './configure' *failed* for opal/libltdl
configure: error: Failed to build GNU libltdl.  This usually means that
something is incorrectly setup with your environment.  There may be
useful information in opal/libltdl/config.log.  You can also disable GNU
libltdl (which will disable dynamic shared object loading) by
configuring with --disable-dlopen.

 end of output of /opal/libltdl/config.log 

## --- ##
## confdefs.h. ##
## --- ##

#define PACKAGE_BUGREPORT "bug-libt...@gnu.org"
#define PACKAGE_NAME "libltdl"
#define PACKAGE_STRING "libltdl 2.1a"
#define PACKAGE_TARNAME "libltdl"
#define PACKAGE_VERSION "2.1a"

configure: exit 1


-Original Message-
Now, if you use the latest trunk, you can use the new MX BTL which
provide support for shared memory and self communications. Add "--mca
pml ob1 --mca btl mx --mca btl_mx_shared_mem 1 --mca btl_mx_self 1"  
in order to activate these new features. If you have a 10G cards, I
suggest you add "--mca btl_mx_flags 2" as well.

   Thanks,
 george.

PS: Is there any way you can attach to the processes with gdb ? I would
like to see the backtrace as showed by gdb in order to be able to figure
out what's wrong there.


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Ompi failing on mx only

2007-01-05 Thread Grobe, Gary L. (JSC-EV)[ESCG]
This is just an FYI of the Jan 5th snapshot.

I'll send a backtrace of the processes as soon as I get a b3 running.
Between my filtered webdav svn access problems and the latest nightly
snapshots, my builds are currently failing where the same config lines
worked on previous snapshots ...

$./configure --prefix=/usr/local/openmpi-1.2b3r13006 --with-mx=/opt/mx
--with-mx-libdir=/opt/mx/lib
...

*** GNU libltdl setup
configure: OMPI configuring in opal/libltdl
configure: running /bin/sh './configure'
'--prefix=/usr/local/openmpi-1.2b3r13006' '--with-mx=/opt/mx'
'--with-mx-libdir=/opt/mx/lib' 'F77=ifort' --enable-ltdl-convenience
--disable-ltdl-install --enable-shared --disable-static
--cache-file=/dev/null --srcdir=.
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... configure: error: newly
created file is older than distributed files!
Check your system clock
configure: /bin/sh './configure' *failed* for opal/libltdl
configure: error: Failed to build GNU libltdl.  This usually means that
something
is incorrectly setup with your environment.  There may be useful
information in
opal/libltdl/config.log.  You can also disable GNU libltdl (which will
disable
dynamic shared object loading) by configuring with --disable-dlopen.

 end of output of /opal/libltdl/config.log 

## --- ##
## confdefs.h. ##
## --- ##

#define PACKAGE_BUGREPORT "bug-libt...@gnu.org"
#define PACKAGE_NAME "libltdl"
#define PACKAGE_STRING "libltdl 2.1a"
#define PACKAGE_TARNAME "libltdl"
#define PACKAGE_VERSION "2.1a"

configure: exit 1


-Original Message-
Now, if you use the latest trunk, you can use the new MX BTL which
provide support for shared memory and self communications. Add "--mca
pml ob1 --mca btl mx --mca btl_mx_shared_mem 1 --mca btl_mx_self 1"  
in order to activate these new features. If you have a 10G cards, I
suggest you add "--mca btl_mx_flags 2" as well.

   Thanks,
 george.

PS: Is there any way you can attach to the processes with gdb ? I would
like to see the backtrace as showed by gdb in order to be able to figure
out what's wrong there.




Re: [OMPI users] Ompi failing on mx only

2007-01-04 Thread Grobe, Gary L. (JSC-EV)[ESCG]
I've grabbed last nights tarball (1.2b3r12981) and tried using the
shared mem transport on btl and mx,self on mtl, same results. What I
don't get is that, sometimes it works, and sometimes it doesn't (for
either). For example, I can run it 10 times successfully then incr the
-np from 7 to 10 across 3 nodes, and it'll immediately fail.

Here's an example of one run right after another.

$ mpirun --prefix /usr/local/openmpi-1.2b3r12981/ -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h25-27 -np 10 --mca mtl
mx,self ./cpi 
Process 0 of 10 is on node-25
Process 4 of 10 is on node-26
Process 1 of 10 is on node-25
Process 5 of 10 is on node-26
Process 2 of 10 is on node-25
Process 8 of 10 is on node-27
Process 6 of 10 is on node-26
Process 9 of 10 is on node-27
Process 7 of 10 is on node-26
Process 3 of 10 is on node-25
pi is approximately 3.1415926544231256, Error is 0.0825
wall clock time = 0.017513

$ mpirun --prefix /usr/local/openmpi-1.2b3r12981/ -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h25-27 -np 10 --mca mtl
mx,self ./cpi 
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:(nil)
[0]
func:/usr/local/openmpi-1.2b3r12981/lib/libopen-pal.so.0(opal_backtrace_
print+0x1f) [0x2b8ddf3ccd3f]
[1] func:/usr/local/openmpi-1.2b3r12981/lib/libopen-pal.so.0
[0x2b8ddf3cb891]
[2] func:/lib/libpthread.so.0 [0x2b8ddf98f6c0]
[3] func:/opt/mx/lib/libmyriexpress.so(mx_open_endpoint+0x6df)
[0x2b8de25bf2af]
[4]
func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_btl_mx.so(mca_btl_mx
_component_init+0x5d7) [0x2b8de27dcd27]
[5]
func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(mca_btl_base_select+
0x156) [0x2b8ddf125b46]
[6]
func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_bml_r2.so(mca_bml_r2
_component_init+0x11) [0x2b8de26d7491]
[7]
func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(mca_bml_base_init+0x
7d) [0x2b8ddf12543d]
[8]
func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_pml_ob1.so(mca_pml_o
b1_component_init+0x6b) [0x2b8de23a4f8b]
[9]
func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(mca_pml_base_select+
0x113) [0x2b8ddf12cea3]
[10]
func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(ompi_mpi_init+0x45a)
[0x2b8ddf0f5bda]
[11] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(MPI_Init+0x83)
[0x2b8ddf116af3]
[12] func:./cpi(main+0x42) [0x400cd5]
[13] func:/lib/libc.so.6(__libc_start_main+0xe3) [0x2b8ddfab50e3]
[14] func:./cpi [0x400bd9]
*** End of error message ***
mpirun noticed that job rank 0 with PID 0 on node node-25 exited on
signal 11.
9 additional processes aborted (not shown) 

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Brian W. Barrett
Sent: Tuesday, January 02, 2007 4:11 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only

Sorry to jump into the discussion late.  The mx btl does not support
communication between processes on the same node by itself, so you have
to include the shared memory transport when using MX.  This will
eventually be fixed, but likely not for the 1.2 release.  So if you do:

   mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --
hostfile ./h1-3 -np 2 --mca btl mx,sm,self ./cpi

It should work much better.  As for the MTL, there is a bug in the MX
MTL for v1.2 that has been fixed, but after 1.2b2 that could cause the
random failures you were seeing.  It will work much better after
1.2b3 is released (or if you are feeling really lucky, you can try out
the 1.2 nightly tarballs).

The MTL is a new feature in v1.2. It is a different communication
abstraction designed to support interconnects that have matching
implemented in the lower level library or in hardware (Myrinet/MX,
Portals, InfiniPath are currently implemented).  The MTL allows us to
exploit the low latency and asynchronous progress these libraries can
provide, but does mean multi-nic abilities are reduced.  Further, the
MTL is not well suited to interconnects like TCP or InfiniBand, so we
will continue supporting the BTL interface as well.

Brian


On Jan 2, 2007, at 2:44 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote:

> About the -x, I've been trying it both ways and prefer the latter, and

> results for either are the same. But it's value is correct.
> I've attached the ompi_info from node-1 and node-2. Sorry for not 
> zipping them, but they were small and I think I'd have firewall 
> issues.
>
> $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH -- 
> hostfile ./h13-15 -np 6 --mca pml cm ./cpi [node-14:19260] mx_connect 
> fail for node-14:0 with key  (error Endpoint closed or not 
> connectable!) [node-14:19261] mx_connect fail for node-14:0 with key 
>  (error Endpoint closed or not connectable!) ...
>
> Is there any info anywhere's on MTL? Anyways, I've run w/ mtl, and 
> sometimes it actually worked once. But now I can't reproduce it and 
> it's throwing sig 7's, 11's, and 4's depending u

Re: [OMPI users] Ompi failing on mx only

2007-01-03 Thread Grobe, Gary L. (JSC-EV)[ESCG]
Just as an FYI, I also included the sm param as you suggested and
changed the -np to 1, because anything more than that just duplicates
the same error. I also saw this same error message in previous posts as
a bug. Would that be the same issue in this case?

$ mpirun --prefix /usr/local/openmpi-1.2b2 --hostfile ./h1-3 -np 1 --mca
btl mx,sm,self ./cpi
[node-1:09704] mca: base: component_find: unable to open mtl mx: file
not found (ignored)
[node-1:09704] mca: base: component_find: unable to open btl mx: file
not found (ignored)
Process 0 of 1 is on node-1
pi is approximately 3.1415926544231341, Error is 0.08333410
wall clock time = 0.000331


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Brian W. Barrett
Sent: Tuesday, January 02, 2007 4:11 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only

Sorry to jump into the discussion late.  The mx btl does not support
communication between processes on the same node by itself, so you have
to include the shared memory transport when using MX.  This will
eventually be fixed, but likely not for the 1.2 release.  So if you do:

   mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --
hostfile ./h1-3 -np 2 --mca btl mx,sm,self ./cpi

It should work much better.  As for the MTL, there is a bug in the MX
MTL for v1.2 that has been fixed, but after 1.2b2 that could cause the
random failures you were seeing.  It will work much better after
1.2b3 is released (or if you are feeling really lucky, you can try out
the 1.2 nightly tarballs).

The MTL is a new feature in v1.2. It is a different communication
abstraction designed to support interconnects that have matching
implemented in the lower level library or in hardware (Myrinet/MX,
Portals, InfiniPath are currently implemented).  The MTL allows us to
exploit the low latency and asynchronous progress these libraries can
provide, but does mean multi-nic abilities are reduced.  Further, the
MTL is not well suited to interconnects like TCP or InfiniBand, so we
will continue supporting the BTL interface as well.

Brian


On Jan 2, 2007, at 2:44 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote:

> About the -x, I've been trying it both ways and prefer the latter, and

> results for either are the same. But it's value is correct.
> I've attached the ompi_info from node-1 and node-2. Sorry for not 
> zipping them, but they were small and I think I'd have firewall 
> issues.
>
> $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH -- 
> hostfile ./h13-15 -np 6 --mca pml cm ./cpi [node-14:19260] mx_connect 
> fail for node-14:0 with key  (error Endpoint closed or not 
> connectable!) [node-14:19261] mx_connect fail for node-14:0 with key 
>  (error Endpoint closed or not connectable!) ...
>
> Is there any info anywhere's on MTL? Anyways, I've run w/ mtl, and 
> sometimes it actually worked once. But now I can't reproduce it and 
> it's throwing sig 7's, 11's, and 4's depending upon the number of 
> procs I give it. But now that you mention mapper, I take it that's 
> what SEGV_MAPERR might be referring to. I'm looking into the
>
> $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=$ 
> {LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl mx,self ./cpi 
> Process 4 of 5 is on node-2 Process 0 of 5 is on node-1 Process 1 of 5

> is on node-1 Process 2 of 5 is on node-1 Process 3 of 5 is on node-1 
> pi is approximately 3.1415926544231225, Error is 0.08333294 
> wall clock time = 0.019305
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at 
> addr:0x2b88243862be mpirun noticed that job rank 0 with PID 0 on node 
> node-1 exited on signal 1.
> 4 additional processes aborted (not shown) Or sometimes I'll get this 
> error, just depending upon the number of procs ...
>
>  mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=$ 
> {LD_LIBRARY_PATH} --hostfile ./h1-3 -np 7 --mca mtl mx,self ./cpi
> Signal:7 info.si_errno:0(Success) si_code:2() Failing at 
> addr:0x2aaab000 [0] 
> func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0
> (opal_backtrace_print+0x1f) [0x2b9b7fa52d1f] [1] 
> func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0
> [0x2b9b7fa51871]
> [2] func:/lib/libpthread.so.0 [0x2b9b80013d00] [3] 
> func:/usr/local/openmpi-1.2b2/lib/libmca_common_sm.so.0
> (mca_common_sm_mmap_init+0x1e3) [0x2b9b8270ef83] [4] 
> func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_mpool_sm.so
> [0x2b9b8260d0ff]
> [5] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0
> (mca_mpool_base_module_create+0x70) [0x2b9b7f7afac0] [6] 
> func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_btl_sm.so
> (mca_btl_sm_add_procs_same_base_addr+0x907) [0x2b9b83070517] [7] 
> func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_bml_r2.so
> (mca_bml_r2_add_procs+0x206) [0x2b9b82d5f576] [

Re: [OMPI users] Ompi failing on mx only

2007-01-02 Thread Grobe, Gary L. (JSC-EV)[ESCG]
I'm losing it today, I just now noticed I sent mx_info for the wrong
nodes ...


// node-1
$ mx_info
MX Version: 1.1.6
MX Build: ggrobe@juggernaut:/home/ggrobe/Tools/mx-1.1.6 Thu Nov 30
14:17:44 GMT 2006
1 Myrinet board installed.
The MX driver is configured to support up to 4 instances and 1024 nodes.
===
Instance #0:  224.9 MHz LANai, 133.3 MHz PCI bus, 2 MB SRAM
Status: Running, P0: Link up
MAC Address:00:60:dd:47:ab:c9
Product code:   M3F-PCIXD-2 V2.2
Part number:09-03034
Serial number:  299207
Mapper: 46:4d:53:4d:41:50, version = 0x000c,
configured
Mapped hosts:   16

ROUTE
COUNT
INDEXMAC ADDRESS HOST NAMEP0
---- ----
   0) 00:60:dd:47:ab:c9 node-1:0  1,1
   1) 00:60:dd:47:c2:a7 juggernaut:0  5,3
   2) 00:60:dd:47:ab:c8 node-2:0  6,3
   3) 00:60:dd:47:ab:ca node-3:0  7,3
   4) 00:60:dd:47:bf:65 node-7:0  7,3
   5) 00:60:dd:47:c2:e1 node-8:0  7,3
   6) 00:60:dd:47:c0:c1 node-9:0  6,3
   7) 00:60:dd:47:c0:e5 node-13:0 1,1
   8) 00:60:dd:47:c2:91 node-14:0 7,3
   9) 00:60:dd:47:c0:b2 node-15:0 7,3
  10) 00:60:dd:47:bf:f5 node-19:0 6,3
  11) 00:60:dd:47:c0:b1 node-20:0 8,3
  12) 00:60:dd:47:c0:f8 node-21:0 5,3
  13) 00:60:dd:47:c0:8a node-25:0 6,3
  14) 00:60:dd:47:c0:c2 node-27:0 7,3
  15) 00:60:dd:47:c2:e0 node-26:0 6,3

// node-2
$ mx_info
MX Version: 1.1.6
MX Build: ggrobe@juggernaut:/home/ggrobe/Tools/mx-1.1.6 Thu Nov 30
14:17:44 GMT 2006
1 Myrinet board installed.
The MX driver is configured to support up to 4 instances and 1024 nodes.
===
Instance #0:  224.9 MHz LANai, 133.0 MHz PCI bus, 2 MB SRAM
Status: Running, P0: Link up
MAC Address:00:60:dd:47:ab:c8
Product code:   M3F-PCIXD-2 V2.2
Part number:09-03034
Serial number:  299208
Mapper: 46:4d:53:4d:41:50, version = 0x000c,
configured
Mapped hosts:   16

ROUTE
COUNT
INDEXMAC ADDRESS HOST NAMEP0
---- ----
   0) 00:60:dd:47:ab:c8 node-2:0  1,1
   1) 00:60:dd:47:ab:c9 node-1:0  6,3
   2) 00:60:dd:47:c2:a7 juggernaut:0  5,3
   3) 00:60:dd:47:ab:ca node-3:0  6,3
   4) 00:60:dd:47:bf:65 node-7:0  5,3
   5) 00:60:dd:47:c2:e1 node-8:0  6,3
   6) 00:60:dd:47:c0:c1 node-9:0  5,3
   7) 00:60:dd:47:c0:e5 node-13:0 6,3
   8) 00:60:dd:47:c2:91 node-14:0 8,3
   9) 00:60:dd:47:c0:b2 node-15:0 1,1
  10) 00:60:dd:47:bf:f5 node-19:0 5,3
  11) 00:60:dd:47:c0:f8 node-21:0 5,3
  12) 00:60:dd:47:c0:8a node-25:0 5,3
  13) 00:60:dd:47:c0:c2 node-27:0 6,3
  14) 00:60:dd:47:c2:e0 node-26:0 5,3
  15) 00:60:dd:47:c0:b1 node-20:0 6,3

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Reese Faucette
Sent: Tuesday, January 02, 2007 4:08 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only

Ompi failing on mx only> I've attached the ompi_info from node-1 and
node-2.

thanks, but i need "mx_info", not "ompi_info" ;-)

> But now that you mention mapper, I take it that's what SEGV_MAPERR 
> might be referring to.

this is an ompi red herring; it has nothing to do with Myrinet mapping,
even though it kinda looks like it.

> $ mpirun --prefix /usr/local/openmpi-1.2b2 -x 
> LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl 
> mx,self ./cpi
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)

A gdb traceback would be interesting on this one.
thanks,
-reese 


___
users mailing list
us...@open-mpi.org

Re: [OMPI users] Ompi failing on mx only

2007-01-02 Thread Grobe, Gary L. (JSC-EV)[ESCG]
Ah, sorry about that ... 

$ ./mx_info
MX Version: 1.1.6
MX Build: ggrobe@juggernaut:/home/ggrobe/Tools/mx-1.1.6 Thu Nov 30
14:17:44 GMT 2006
1 Myrinet board installed.
The MX driver is configured to support up to 4 instances and 1024 nodes.
===
Instance #0:  224.9 MHz LANai, 99.7 MHz PCI bus, 2 MB SRAM
Status: Running, P0: Link up
MAC Address:00:60:dd:47:c2:a7
Product code:   M3F-PCIXD-2 V2.2
Part number:09-03034
Serial number:  291824
Mapper: 46:4d:53:4d:41:50, version = 0x000c,
configured
Mapped hosts:   16

ROUTE
COUNT
INDEXMAC ADDRESS HOST NAMEP0
---- ----
   0) 00:60:dd:47:c2:a7 juggernaut:0  1,1
   1) 00:60:dd:47:ab:c9 node-1:0  6,3
   2) 00:60:dd:47:ab:c8 node-2:0  6,3
   3) 00:60:dd:47:ab:ca node-3:0  6,3
   4) 00:60:dd:47:bf:65 node-7:0  6,3
   5) 00:60:dd:47:c2:e1 node-8:0  6,3
   6) 00:60:dd:47:c0:c1 node-9:0  6,3
   7) 00:60:dd:47:c0:e5 node-13:0 6,3
   8) 00:60:dd:47:c2:91 node-14:0 6,3
   9) 00:60:dd:47:c0:b2 node-15:0 6,3
  10) 00:60:dd:47:bf:f5 node-19:0 1,1
  11) 00:60:dd:47:c0:b1 node-20:0 6,3
  12) 00:60:dd:47:c0:f8 node-21:0 7,3
  13) 00:60:dd:47:c0:8a node-25:0 6,3
  14) 00:60:dd:47:c0:c2 node-27:0 5,3
  15) 00:60:dd:47:c2:e0 node-26:0 5,3

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Reese Faucette
Sent: Tuesday, January 02, 2007 4:08 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only

Ompi failing on mx only> I've attached the ompi_info from node-1 and
node-2.

thanks, but i need "mx_info", not "ompi_info" ;-)

> But now that you mention mapper, I take it that's what SEGV_MAPERR 
> might be referring to.

this is an ompi red herring; it has nothing to do with Myrinet mapping,
even though it kinda looks like it.

> $ mpirun --prefix /usr/local/openmpi-1.2b2 -x 
> LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl 
> mx,self ./cpi
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)

A gdb traceback would be interesting on this one.
thanks,
-reese 


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] Ompi failing on mx only

2007-01-02 Thread Grobe, Gary L. (JSC-EV)[ESCG]

I was initially using 1.1.2 and moved to 1.2b2 because of a hang on
MPI_Bcast() which 1.2b2 reports to fix, and seemed to have done so. My
compute nodes are 2 dual core xeons on myrinet with mx. The problem is
trying to get ompi running on mx only. My machine file is as follows ...

node-1 slots=4 max-slots=4
node-2 slots=4 max-slots=4
node-3 slots=4 max-slots=4

'mpirun' with the minimum number of processes in order to get the error
...
mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH
--hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi

Results with the following output ...

:~/Projects/ompi/cpi$ mpirun --prefix /usr/local/openmpi-1.2b2 -x
LD_LIBRARY_PATH --hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi


--
Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of 
usable components.

--

--
Process 0.1.1 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of 
usable components.

--

--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)

--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)

--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)

--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
mpirun noticed that job rank 1 with PID 0 on node node-1 exited on
signal 1.

 end of output ---

I get that same error w/ the examples included in the ompi-1.2b2
distrib. However, if I change the mca params as such ...

mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH
--hostfile ./h1-3 -np 5 --mca pml cm ./cpi

Running up to -np 5 works (one of the processes does get put on the 2nd
node), but running with -np 6 fails with the following ...

[node-2:10464] mx_connect fail for node-2:0 with key  (error
Endpoint closed or not connectable!)
[node-2:10463] mx_connect fail for node-2:0 with key  (error
Endpoint closed or not connectable!)

--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)

--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)

--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)

--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
mpirun noticed that job rank 0 with PID 0 on node node-1 exited on