Re: [OMPI users] Cannot suppress openib error message

2007-10-25 Thread Dirk Eddelbuettel

On 25 October 2007 at 07:54, Jeff Squyres wrote:
| We will not dlopen libibverbs.so directly -- we will only dlopen the  
| mca_btl_openib.so file.  The dynamic linker will automatically open  
| all of its dependencies.  If those dependencies cannot be found /  
| symbols cannot be resolved, the dynamic linker will fail the dlopen  
| of libibverbs.
| 
| Can you run "ldd mca_btl_openib.so" on your head node and your  
| compute nodes?  See if there's a difference in the output.  I think  
| this is the next step in this troubleshooting process...

Sure, good idea.

head and build machine:

$ ldd /usr/lib/openmpi/mca_btl_openib.so
linux-gate.so.1 =>  (0xe000)
libibverbs.so.1 => /usr/lib/libibverbs.so.1 (0xb7f42000)
libpthread.so.0 => /lib/libpthread.so.0 (0xb7f2b000)
libmpi.so.0 => /usr/lib/libmpi.so.0 (0xb7ea6000)
libopen-rte.so.0 => /usr/lib/libopen-rte.so.0 (0xb7e52000)
libopen-pal.so.0 => /usr/lib/libopen-pal.so.0 (0xb7dfb000)
libdl.so.2 => /lib/libdl.so.2 (0xb7df7000)
libnsl.so.1 => /lib/libnsl.so.1 (0xb7de1000)
libutil.so.1 => /lib/libutil.so.1 (0xb7ddd000)
libm.so.6 => /lib/libm.so.6 (0xb7db7000)
libc.so.6 => /lib/libc.so.6 (0xb7c8a000)
/lib/ld-linux.so.2 (0x8000)

compute node:
$ ldd /usr/lib/openmpi/mca_btl_openib.so
/usr/lib/openmpi/mca_btl_openib.so: /usr/lib/libibverbs.so.1: version 
`IBVERBS_1.1' not found (required by /usr/lib/openmpi/mca_btl_openib.so)
linux-gate.so.1 =>  (0xe000)
libibverbs.so.1 => /usr/lib/libibverbs.so.1 (0xb7ee6000)
libpthread.so.0 => /lib/tls/i686/cmov/libpthread.so.0 (0xb7ecf000)
libmpi.so.0 => /usr/lib/libmpi.so.0 (0xb7e4a000)
libopen-rte.so.0 => /usr/lib/libopen-rte.so.0 (0xb7df6000)
libopen-pal.so.0 => /usr/lib/libopen-pal.so.0 (0xb7d9f000)
libdl.so.2 => /lib/tls/i686/cmov/libdl.so.2 (0xb7d9b000)
libnsl.so.1 => /lib/tls/i686/cmov/libnsl.so.1 (0xb7d84000)
libutil.so.1 => /lib/tls/i686/cmov/libutil.so.1 (0xb7d8)
libm.so.6 => /lib/tls/i686/cmov/libm.so.6 (0xb7d58000)
libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7c17000)
libsysfs.so.2 => /lib/libsysfs.so.2 (0xb7c0c000)
/lib/ld-linux.so.2 (0x8000)

Bingo!!  And I am being found with my package install being inconsistent. Tst 
tst.
I *think* this may be due to the fact that at one point before "we" (as in
the few folks looking after the .deb for Open MPI) had learned about the 'btl
^openib' option and I had become so disenchanted with the 'noisy' message
that I hacked libibverbs.  That may explain the head-node.  Let me get that
one back to the pristine Ubuntu / Debian package, and then to possibly
rebuild the Open MPI package there to correct depends going.

Thanks so much for your help and patience on this.

Dirk

-- 
Three out of two people have difficulties with fractions.


Re: [OMPI users] Cannot suppress openib error message

2007-10-25 Thread Jeff Squyres

On Oct 24, 2007, at 10:05 PM, Dirk Eddelbuettel wrote:

| > | If I had to guess, the systems where you don't see the  
warning are

| > | systems that have OFED loaded.
| >
| > I am pretty sure that none of the systems (at work) have IB
| > hardware.  I am
| > very sure that my home systems do not, and there the 'btl =  
^openib'
| > successfully suppresses the warning --- whereas at work it  
doesn't.

|
| Note that you don't need to have IB hardware -- all you need is the
| OFED software loaded.  I don't know if Debian ships the OFED
| libraries by default...?  In particular, look for libibverbs:
|
| [18:28] svbu-mpi:~/svn/ompi % ldd $bogus/lib/openmpi/ 
mca_btl_openib.so

|  libibverbs.so.1 => /usr/lib64/libibverbs.so.1
| (0x002a956c2000)
|  libnsl.so.1 => /lib64/libnsl.so.1 (0x002a957cd000)
|  libutil.so.1 => /lib64/libutil.so.1 (0x002a958e4000)
|  libm.so.6 => /lib64/tls/libm.so.6 (0x002a959e8000)
|  libpthread.so.0 => /lib64/tls/libpthread.so.0
| (0x002a95b6e000)
|  libc.so.6 => /lib64/tls/libc.so.6 (0x002a95c83000)
|  libdl.so.2 => /lib64/libdl.so.2 (0x002a95eb8000)
|  /lib64/ld-linux-x86-64.so.2 (0x00552000)

Good point.  However, I use the .deb packages which are I build for  
Debian,

and they use libibverbs where available:

Build-Depends: [...], libibverbs-dev [!kfreebsd-i386 !kfreebsd- 
amd64 \

!hurd-i386], gfortran, libsysfs-dev, automake, gcc (>= 4:4.1.2)

in particular on i386. Consequently, the binary package ends up with a
Depends on the run-time package 'libibverbs1' -- and this will  
hence always
be present as all my systems use the .deb packages (either from  
Debian or

locally rebuild) that forces libibverbs1 in via this Depends.

At work, I re-build these same package under Ubuntu on my "head  
node".  And

on the head node, no warning is seen -- wherease my computes issue the
warning.

Could this be another one of the dlopen issues where basically
ldopen("libibverbs.so")
is executed?   Because the compute nodes do NOT have libibverbs.so  
(from the

-dev package) but only libibverbs.so.1.0.0 and its matching symlink
libibverbs.so.1.


We will not dlopen libibverbs.so directly -- we will only dlopen the  
mca_btl_openib.so file.  The dynamic linker will automatically open  
all of its dependencies.  If those dependencies cannot be found /  
symbols cannot be resolved, the dynamic linker will fail the dlopen  
of libibverbs.


Can you run "ldd mca_btl_openib.so" on your head node and your  
compute nodes?  See if there's a difference in the output.  I think  
this is the next step in this troubleshooting process...


I just tested that hypothesis and install libibverbs-dev, but no  
beans. Still

get the warning.

| However, I note something in your last reply that I may have missed
| before -- can you clarify a point for me: are you saying that on  
your

| home machine, this generates the openib "file not found" warning:
|
|  mpirun -np 2 hello
|
| but this does not:
|
|  mpirun -np 2 --mca btl ^openib hello

More or less, but I use /etc/openmpi/openmci-mca-params.conf to toggle
^openib.  Adding it again as --mca btl ^openib changes nothing,  
unfortunately.


This MCA behavior is as expected; adding a param to openmpi-mca- 
params.conf is exactly the same as putting it on the command line  
(except that the command line has higher precedence).


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Cannot suppress openib error message

2007-10-24 Thread Dirk Eddelbuettel

On 24 October 2007 at 21:31, Jeff Squyres wrote:
| On Oct 24, 2007, at 9:23 PM, Dirk Eddelbuettel wrote:
| 
| > | If I had to guess, the systems where you don't see the warning are
| > | systems that have OFED loaded.
| >
| > I am pretty sure that none of the systems (at work) have IB  
| > hardware.  I am
| > very sure that my home systems do not, and there the 'btl = ^openib'
| > successfully suppresses the warning --- whereas at work it doesn't.
| 
| Note that you don't need to have IB hardware -- all you need is the  
| OFED software loaded.  I don't know if Debian ships the OFED  
| libraries by default...?  In particular, look for libibverbs:
| 
| [18:28] svbu-mpi:~/svn/ompi % ldd $bogus/lib/openmpi/mca_btl_openib.so
|  libibverbs.so.1 => /usr/lib64/libibverbs.so.1  
| (0x002a956c2000)
|  libnsl.so.1 => /lib64/libnsl.so.1 (0x002a957cd000)
|  libutil.so.1 => /lib64/libutil.so.1 (0x002a958e4000)
|  libm.so.6 => /lib64/tls/libm.so.6 (0x002a959e8000)
|  libpthread.so.0 => /lib64/tls/libpthread.so.0  
| (0x002a95b6e000)
|  libc.so.6 => /lib64/tls/libc.so.6 (0x002a95c83000)
|  libdl.so.2 => /lib64/libdl.so.2 (0x002a95eb8000)
|  /lib64/ld-linux-x86-64.so.2 (0x00552000)

Good point.  However, I use the .deb packages which are I build for Debian,
and they use libibverbs where available:

Build-Depends: [...], libibverbs-dev [!kfreebsd-i386 !kfreebsd-amd64 \
!hurd-i386], gfortran, libsysfs-dev, automake, gcc (>= 4:4.1.2)

in particular on i386. Consequently, the binary package ends up with a
Depends on the run-time package 'libibverbs1' -- and this will hence always
be present as all my systems use the .deb packages (either from Debian or
locally rebuild) that forces libibverbs1 in via this Depends.

At work, I re-build these same package under Ubuntu on my "head node".  And
on the head node, no warning is seen -- wherease my computes issue the
warning.

Could this be another one of the dlopen issues where basically
ldopen("libibverbs.so") 
is executed?   Because the compute nodes do NOT have libibverbs.so (from the
-dev package) but only libibverbs.so.1.0.0 and its matching symlink
libibverbs.so.1.

I just tested that hypothesis and install libibverbs-dev, but no beans. Still
get the warning. 

| However, I note something in your last reply that I may have missed  
| before -- can you clarify a point for me: are you saying that on your  
| home machine, this generates the openib "file not found" warning:
| 
|  mpirun -np 2 hello
| 
| but this does not:
| 
|  mpirun -np 2 --mca btl ^openib hello

More or less, but I use /etc/openmpi/openmci-mca-params.conf to toggle
^openib.  Adding it again as --mca btl ^openib changes nothing, unfortunately.

| If so, can you confirm which version of Open MPI you are running?   
| The only reason that I can think that that would happen is if you are  
| running a trunk nightly download of Open MPI...  If not, then there's  
| something else going on that would be worth understanding.

No, plain 1.2.4 from the original tarballs.

Still puzzled.  To recap, the head node and the compute node all use the same
Ubuntu release, use the same binary .deb packages from Open MPI 1.2.4 I
rebuild there.  The 'sole' difference is that the 'head node' has more
development packages and tools installed -- but that should not matter.  I
just re-checked and the compute node does not have any LAM or MPICH
parts remaining.

Dirk

-- 
Three out of two people have difficulties with fractions.


Re: [OMPI users] Cannot suppress openib error message

2007-10-24 Thread Jeff Squyres

On Oct 24, 2007, at 9:23 PM, Dirk Eddelbuettel wrote:


| If I had to guess, the systems where you don't see the warning are
| systems that have OFED loaded.

I am pretty sure that none of the systems (at work) have IB  
hardware.  I am

very sure that my home systems do not, and there the 'btl = ^openib'
successfully suppresses the warning --- whereas at work it doesn't.


Note that you don't need to have IB hardware -- all you need is the  
OFED software loaded.  I don't know if Debian ships the OFED  
libraries by default...?  In particular, look for libibverbs:


[18:28] svbu-mpi:~/svn/ompi % ldd $bogus/lib/openmpi/mca_btl_openib.so
libibverbs.so.1 => /usr/lib64/libibverbs.so.1  
(0x002a956c2000)

libnsl.so.1 => /lib64/libnsl.so.1 (0x002a957cd000)
libutil.so.1 => /lib64/libutil.so.1 (0x002a958e4000)
libm.so.6 => /lib64/tls/libm.so.6 (0x002a959e8000)
libpthread.so.0 => /lib64/tls/libpthread.so.0  
(0x002a95b6e000)

libc.so.6 => /lib64/tls/libc.so.6 (0x002a95c83000)
libdl.so.2 => /lib64/libdl.so.2 (0x002a95eb8000)
/lib64/ld-linux-x86-64.so.2 (0x00552000)

However, I note something in your last reply that I may have missed  
before -- can you clarify a point for me: are you saying that on your  
home machine, this generates the openib "file not found" warning:


mpirun -np 2 hello

but this does not:

mpirun -np 2 --mca btl ^openib hello

If so, can you confirm which version of Open MPI you are running?   
The only reason that I can think that that would happen is if you are  
running a trunk nightly download of Open MPI...  If not, then there's  
something else going on that would be worth understanding.


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Cannot suppress openib error message

2007-10-24 Thread Dirk Eddelbuettel

On 24 October 2007 at 16:22, Jeff Squyres wrote:
| On Oct 24, 2007, at 4:16 PM, Dirk Eddelbuettel wrote:
| 
| > I buy that explanation any day, but what is funny is that the
| > btl = ^openib
| > does suppress the warning on some of my systems (all running 1.2.4)  
| > but not
| > others (also running 1.2.4).
| 
| If I had to guess, the systems where you don't see the warning are  
| systems that have OFED loaded.

I am pretty sure that none of the systems (at work) have IB hardware.  I am
very sure that my home systems do not, and there the 'btl = ^openib'
successfully suppresses the warning --- whereas at work it doesn't.

Must be a side-effect from something else. I made sure not lam libs were
left around.  

Dirk


-- 
Three out of two people have difficulties with fractions.


Re: [OMPI users] Cannot suppress openib error message

2007-10-24 Thread Jeff Squyres

On Oct 24, 2007, at 4:16 PM, Dirk Eddelbuettel wrote:


I buy that explanation any day, but what is funny is that the
btl = ^openib
does suppress the warning on some of my systems (all running 1.2.4)  
but not

others (also running 1.2.4).


If I had to guess, the systems where you don't see the warning are  
systems that have OFED loaded.


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Cannot suppress openib error message

2007-10-24 Thread Dirk Eddelbuettel

Hi Jeff,

On 24 October 2007 at 15:43, Jeff Squyres wrote:
| This is quite likely because of a "feature" in how the OMPI v1.2  
| series handles its plugins.  In OMPI <=v1.2.x, Open MPI opens all  
| plugins that it can find and *then* applies the filter that you  
| provide (e.g., via the "btl" MCA param) to close / ignore certain  
| plugins.
| 
| In OMPI >=v1.3, we [effectively] apply the filter *before* opening  
| plugins.  So "--mca btl ^openib" will actually prevent the openib BTL  
| plugin from being loaded.
| 
| I'm guessing that what you're seeing today is because we're opening  
| the openib BTL on a system where the OpenFabrics support libraries  
| are not available, and therefore the dlopen() fails.  The error  
| string that we get back from libltdl is the somewhat-misleading "file  
| not found (ignored)", and that's what we print (note that ltdl is  
| referring to the fact that a dependent library is not found).

I buy that explanation any day, but what is funny is that the 
btl = ^openib
does suppress the warning on some of my systems (all running 1.2.4) but not
others (also running 1.2.4).

Hm.

Dirk

| On Oct 24, 2007, at 9:51 AM, Dirk Eddelbuettel wrote:
| 
| >
| > I've been scratching my head over this:
| >
| > lnx01:/usr/lib> orterun -n 2  --mca btl ^openib  ~/c++/tests/mpitest
| > [lnx01:14417] mca: base: component_find: unable to open btl openib:  
| > file not found (ignored)
| > [lnx01:14418] mca: base: component_find: unable to open btl openib:  
| > file not found (ignored)
| > Hello world, I'm process 0
| > Hello world, I'm process 1
| > lnx01:/usr/lib> grep openib /etc/openmpi/openmpi-mca-params.conf
| > #   btl = ^openib
| > btl = ^openib
| > lnx01:/usr/lib> orterun -n 2   ~/c++/tests/mpitest
| > [lnx01:14429] mca: base: component_find: unable to open btl openib:  
| > file not found (ignored)
| > [lnx01:14430] mca: base: component_find: unable to open btl openib:  
| > file not found (ignored)
| > Hello world, I'm process 0
| > Hello world, I'm process 1
| >
| > and when I strace it, I get
| >
| > uname({sys="Linux", node="lnx01", ...}) = 0
| > open("/etc/openmpi/openmpi-mca-params.conf", O_RDONLY) = 3
| > ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbf820698) = -1 ENOTTY  
| > (Inappropriate ioctl for device)
| > fstat64(3, {st_mode=S_IFREG|0644, st_size=2877, ...}) = 0
| > mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,  
| > -1, 0) = 0xb7f72000
| > read(3, "#\n# Copyright (c) 2004-2005 The "..., 8192) = 2877
| > read(3, "", 4096)   = 0
| > read(3, "", 8192)   = 0
| > ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbf8205f8) = -1 ENOTTY  
| > (Inappropriate ioctl for device)
| > close(3)= 0
| > munmap(0xb7f72000, 4096)= 0
| >
| > Why can't I suppress the dreaded Infinityband message?
| >
| > System is Ubuntu 7.04 with 'ported' (ie locally recompiled) current  
| > Open MPI packages
| > from Debian.
| >
| > Dirk
| >
| > -- 
| > Three out of two people have difficulties with fractions.
| > ___
| > users mailing list
| > us...@open-mpi.org
| > http://www.open-mpi.org/mailman/listinfo.cgi/users
| 
| 
| -- 
| Jeff Squyres
| Cisco Systems
| 
| ___
| users mailing list
| us...@open-mpi.org
| http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Three out of two people have difficulties with fractions.


Re: [OMPI users] Cannot suppress openib error message

2007-10-24 Thread Jeff Squyres
This is quite likely because of a "feature" in how the OMPI v1.2  
series handles its plugins.  In OMPI <=v1.2.x, Open MPI opens all  
plugins that it can find and *then* applies the filter that you  
provide (e.g., via the "btl" MCA param) to close / ignore certain  
plugins.


In OMPI >=v1.3, we [effectively] apply the filter *before* opening  
plugins.  So "--mca btl ^openib" will actually prevent the openib BTL  
plugin from being loaded.


I'm guessing that what you're seeing today is because we're opening  
the openib BTL on a system where the OpenFabrics support libraries  
are not available, and therefore the dlopen() fails.  The error  
string that we get back from libltdl is the somewhat-misleading "file  
not found (ignored)", and that's what we print (note that ltdl is  
referring to the fact that a dependent library is not found).




On Oct 24, 2007, at 9:51 AM, Dirk Eddelbuettel wrote:



I've been scratching my head over this:

lnx01:/usr/lib> orterun -n 2  --mca btl ^openib  ~/c++/tests/mpitest
[lnx01:14417] mca: base: component_find: unable to open btl openib:  
file not found (ignored)
[lnx01:14418] mca: base: component_find: unable to open btl openib:  
file not found (ignored)

Hello world, I'm process 0
Hello world, I'm process 1
lnx01:/usr/lib> grep openib /etc/openmpi/openmpi-mca-params.conf
#   btl = ^openib
btl = ^openib
lnx01:/usr/lib> orterun -n 2   ~/c++/tests/mpitest
[lnx01:14429] mca: base: component_find: unable to open btl openib:  
file not found (ignored)
[lnx01:14430] mca: base: component_find: unable to open btl openib:  
file not found (ignored)

Hello world, I'm process 0
Hello world, I'm process 1

and when I strace it, I get

uname({sys="Linux", node="lnx01", ...}) = 0
open("/etc/openmpi/openmpi-mca-params.conf", O_RDONLY) = 3
ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbf820698) = -1 ENOTTY  
(Inappropriate ioctl for device)

fstat64(3, {st_mode=S_IFREG|0644, st_size=2877, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,  
-1, 0) = 0xb7f72000

read(3, "#\n# Copyright (c) 2004-2005 The "..., 8192) = 2877
read(3, "", 4096)   = 0
read(3, "", 8192)   = 0
ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbf8205f8) = -1 ENOTTY  
(Inappropriate ioctl for device)

close(3)= 0
munmap(0xb7f72000, 4096)= 0

Why can't I suppress the dreaded Infinityband message?

System is Ubuntu 7.04 with 'ported' (ie locally recompiled) current  
Open MPI packages

from Debian.

Dirk

--
Three out of two people have difficulties with fractions.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



[OMPI users] Cannot suppress openib error message

2007-10-24 Thread Dirk Eddelbuettel

I've been scratching my head over this:

lnx01:/usr/lib> orterun -n 2  --mca btl ^openib  ~/c++/tests/mpitest
[lnx01:14417] mca: base: component_find: unable to open btl openib: file not 
found (ignored)
[lnx01:14418] mca: base: component_find: unable to open btl openib: file not 
found (ignored)
Hello world, I'm process 0
Hello world, I'm process 1
lnx01:/usr/lib> grep openib /etc/openmpi/openmpi-mca-params.conf
#   btl = ^openib
btl = ^openib
lnx01:/usr/lib> orterun -n 2   ~/c++/tests/mpitest
[lnx01:14429] mca: base: component_find: unable to open btl openib: file not 
found (ignored)
[lnx01:14430] mca: base: component_find: unable to open btl openib: file not 
found (ignored)
Hello world, I'm process 0
Hello world, I'm process 1

and when I strace it, I get

uname({sys="Linux", node="lnx01", ...}) = 0
open("/etc/openmpi/openmpi-mca-params.conf", O_RDONLY) = 3
ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbf820698) = -1 ENOTTY (Inappropriate 
ioctl for device)
fstat64(3, {st_mode=S_IFREG|0644, st_size=2877, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0xb7f72000
read(3, "#\n# Copyright (c) 2004-2005 The "..., 8192) = 2877
read(3, "", 4096)   = 0
read(3, "", 8192)   = 0
ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbf8205f8) = -1 ENOTTY (Inappropriate 
ioctl for device)
close(3)= 0
munmap(0xb7f72000, 4096)= 0

Why can't I suppress the dreaded Infinityband message?

System is Ubuntu 7.04 with 'ported' (ie locally recompiled) current Open MPI 
packages
from Debian. 

Dirk

-- 
Three out of two people have difficulties with fractions.