Re: [OMPI users] Cannot suppress openib error message
On 25 October 2007 at 07:54, Jeff Squyres wrote: | We will not dlopen libibverbs.so directly -- we will only dlopen the | mca_btl_openib.so file. The dynamic linker will automatically open | all of its dependencies. If those dependencies cannot be found / | symbols cannot be resolved, the dynamic linker will fail the dlopen | of libibverbs. | | Can you run "ldd mca_btl_openib.so" on your head node and your | compute nodes? See if there's a difference in the output. I think | this is the next step in this troubleshooting process... Sure, good idea. head and build machine: $ ldd /usr/lib/openmpi/mca_btl_openib.so linux-gate.so.1 => (0xe000) libibverbs.so.1 => /usr/lib/libibverbs.so.1 (0xb7f42000) libpthread.so.0 => /lib/libpthread.so.0 (0xb7f2b000) libmpi.so.0 => /usr/lib/libmpi.so.0 (0xb7ea6000) libopen-rte.so.0 => /usr/lib/libopen-rte.so.0 (0xb7e52000) libopen-pal.so.0 => /usr/lib/libopen-pal.so.0 (0xb7dfb000) libdl.so.2 => /lib/libdl.so.2 (0xb7df7000) libnsl.so.1 => /lib/libnsl.so.1 (0xb7de1000) libutil.so.1 => /lib/libutil.so.1 (0xb7ddd000) libm.so.6 => /lib/libm.so.6 (0xb7db7000) libc.so.6 => /lib/libc.so.6 (0xb7c8a000) /lib/ld-linux.so.2 (0x8000) compute node: $ ldd /usr/lib/openmpi/mca_btl_openib.so /usr/lib/openmpi/mca_btl_openib.so: /usr/lib/libibverbs.so.1: version `IBVERBS_1.1' not found (required by /usr/lib/openmpi/mca_btl_openib.so) linux-gate.so.1 => (0xe000) libibverbs.so.1 => /usr/lib/libibverbs.so.1 (0xb7ee6000) libpthread.so.0 => /lib/tls/i686/cmov/libpthread.so.0 (0xb7ecf000) libmpi.so.0 => /usr/lib/libmpi.so.0 (0xb7e4a000) libopen-rte.so.0 => /usr/lib/libopen-rte.so.0 (0xb7df6000) libopen-pal.so.0 => /usr/lib/libopen-pal.so.0 (0xb7d9f000) libdl.so.2 => /lib/tls/i686/cmov/libdl.so.2 (0xb7d9b000) libnsl.so.1 => /lib/tls/i686/cmov/libnsl.so.1 (0xb7d84000) libutil.so.1 => /lib/tls/i686/cmov/libutil.so.1 (0xb7d8) libm.so.6 => /lib/tls/i686/cmov/libm.so.6 (0xb7d58000) libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7c17000) libsysfs.so.2 => /lib/libsysfs.so.2 (0xb7c0c000) /lib/ld-linux.so.2 (0x8000) Bingo!! And I am being found with my package install being inconsistent. Tst tst. I *think* this may be due to the fact that at one point before "we" (as in the few folks looking after the .deb for Open MPI) had learned about the 'btl ^openib' option and I had become so disenchanted with the 'noisy' message that I hacked libibverbs. That may explain the head-node. Let me get that one back to the pristine Ubuntu / Debian package, and then to possibly rebuild the Open MPI package there to correct depends going. Thanks so much for your help and patience on this. Dirk -- Three out of two people have difficulties with fractions.
Re: [OMPI users] Cannot suppress openib error message
On Oct 24, 2007, at 10:05 PM, Dirk Eddelbuettel wrote: | > | If I had to guess, the systems where you don't see the warning are | > | systems that have OFED loaded. | > | > I am pretty sure that none of the systems (at work) have IB | > hardware. I am | > very sure that my home systems do not, and there the 'btl = ^openib' | > successfully suppresses the warning --- whereas at work it doesn't. | | Note that you don't need to have IB hardware -- all you need is the | OFED software loaded. I don't know if Debian ships the OFED | libraries by default...? In particular, look for libibverbs: | | [18:28] svbu-mpi:~/svn/ompi % ldd $bogus/lib/openmpi/ mca_btl_openib.so | libibverbs.so.1 => /usr/lib64/libibverbs.so.1 | (0x002a956c2000) | libnsl.so.1 => /lib64/libnsl.so.1 (0x002a957cd000) | libutil.so.1 => /lib64/libutil.so.1 (0x002a958e4000) | libm.so.6 => /lib64/tls/libm.so.6 (0x002a959e8000) | libpthread.so.0 => /lib64/tls/libpthread.so.0 | (0x002a95b6e000) | libc.so.6 => /lib64/tls/libc.so.6 (0x002a95c83000) | libdl.so.2 => /lib64/libdl.so.2 (0x002a95eb8000) | /lib64/ld-linux-x86-64.so.2 (0x00552000) Good point. However, I use the .deb packages which are I build for Debian, and they use libibverbs where available: Build-Depends: [...], libibverbs-dev [!kfreebsd-i386 !kfreebsd- amd64 \ !hurd-i386], gfortran, libsysfs-dev, automake, gcc (>= 4:4.1.2) in particular on i386. Consequently, the binary package ends up with a Depends on the run-time package 'libibverbs1' -- and this will hence always be present as all my systems use the .deb packages (either from Debian or locally rebuild) that forces libibverbs1 in via this Depends. At work, I re-build these same package under Ubuntu on my "head node". And on the head node, no warning is seen -- wherease my computes issue the warning. Could this be another one of the dlopen issues where basically ldopen("libibverbs.so") is executed? Because the compute nodes do NOT have libibverbs.so (from the -dev package) but only libibverbs.so.1.0.0 and its matching symlink libibverbs.so.1. We will not dlopen libibverbs.so directly -- we will only dlopen the mca_btl_openib.so file. The dynamic linker will automatically open all of its dependencies. If those dependencies cannot be found / symbols cannot be resolved, the dynamic linker will fail the dlopen of libibverbs. Can you run "ldd mca_btl_openib.so" on your head node and your compute nodes? See if there's a difference in the output. I think this is the next step in this troubleshooting process... I just tested that hypothesis and install libibverbs-dev, but no beans. Still get the warning. | However, I note something in your last reply that I may have missed | before -- can you clarify a point for me: are you saying that on your | home machine, this generates the openib "file not found" warning: | | mpirun -np 2 hello | | but this does not: | | mpirun -np 2 --mca btl ^openib hello More or less, but I use /etc/openmpi/openmci-mca-params.conf to toggle ^openib. Adding it again as --mca btl ^openib changes nothing, unfortunately. This MCA behavior is as expected; adding a param to openmpi-mca- params.conf is exactly the same as putting it on the command line (except that the command line has higher precedence). -- Jeff Squyres Cisco Systems
Re: [OMPI users] Cannot suppress openib error message
On 24 October 2007 at 21:31, Jeff Squyres wrote: | On Oct 24, 2007, at 9:23 PM, Dirk Eddelbuettel wrote: | | > | If I had to guess, the systems where you don't see the warning are | > | systems that have OFED loaded. | > | > I am pretty sure that none of the systems (at work) have IB | > hardware. I am | > very sure that my home systems do not, and there the 'btl = ^openib' | > successfully suppresses the warning --- whereas at work it doesn't. | | Note that you don't need to have IB hardware -- all you need is the | OFED software loaded. I don't know if Debian ships the OFED | libraries by default...? In particular, look for libibverbs: | | [18:28] svbu-mpi:~/svn/ompi % ldd $bogus/lib/openmpi/mca_btl_openib.so | libibverbs.so.1 => /usr/lib64/libibverbs.so.1 | (0x002a956c2000) | libnsl.so.1 => /lib64/libnsl.so.1 (0x002a957cd000) | libutil.so.1 => /lib64/libutil.so.1 (0x002a958e4000) | libm.so.6 => /lib64/tls/libm.so.6 (0x002a959e8000) | libpthread.so.0 => /lib64/tls/libpthread.so.0 | (0x002a95b6e000) | libc.so.6 => /lib64/tls/libc.so.6 (0x002a95c83000) | libdl.so.2 => /lib64/libdl.so.2 (0x002a95eb8000) | /lib64/ld-linux-x86-64.so.2 (0x00552000) Good point. However, I use the .deb packages which are I build for Debian, and they use libibverbs where available: Build-Depends: [...], libibverbs-dev [!kfreebsd-i386 !kfreebsd-amd64 \ !hurd-i386], gfortran, libsysfs-dev, automake, gcc (>= 4:4.1.2) in particular on i386. Consequently, the binary package ends up with a Depends on the run-time package 'libibverbs1' -- and this will hence always be present as all my systems use the .deb packages (either from Debian or locally rebuild) that forces libibverbs1 in via this Depends. At work, I re-build these same package under Ubuntu on my "head node". And on the head node, no warning is seen -- wherease my computes issue the warning. Could this be another one of the dlopen issues where basically ldopen("libibverbs.so") is executed? Because the compute nodes do NOT have libibverbs.so (from the -dev package) but only libibverbs.so.1.0.0 and its matching symlink libibverbs.so.1. I just tested that hypothesis and install libibverbs-dev, but no beans. Still get the warning. | However, I note something in your last reply that I may have missed | before -- can you clarify a point for me: are you saying that on your | home machine, this generates the openib "file not found" warning: | | mpirun -np 2 hello | | but this does not: | | mpirun -np 2 --mca btl ^openib hello More or less, but I use /etc/openmpi/openmci-mca-params.conf to toggle ^openib. Adding it again as --mca btl ^openib changes nothing, unfortunately. | If so, can you confirm which version of Open MPI you are running? | The only reason that I can think that that would happen is if you are | running a trunk nightly download of Open MPI... If not, then there's | something else going on that would be worth understanding. No, plain 1.2.4 from the original tarballs. Still puzzled. To recap, the head node and the compute node all use the same Ubuntu release, use the same binary .deb packages from Open MPI 1.2.4 I rebuild there. The 'sole' difference is that the 'head node' has more development packages and tools installed -- but that should not matter. I just re-checked and the compute node does not have any LAM or MPICH parts remaining. Dirk -- Three out of two people have difficulties with fractions.
Re: [OMPI users] Cannot suppress openib error message
On Oct 24, 2007, at 9:23 PM, Dirk Eddelbuettel wrote: | If I had to guess, the systems where you don't see the warning are | systems that have OFED loaded. I am pretty sure that none of the systems (at work) have IB hardware. I am very sure that my home systems do not, and there the 'btl = ^openib' successfully suppresses the warning --- whereas at work it doesn't. Note that you don't need to have IB hardware -- all you need is the OFED software loaded. I don't know if Debian ships the OFED libraries by default...? In particular, look for libibverbs: [18:28] svbu-mpi:~/svn/ompi % ldd $bogus/lib/openmpi/mca_btl_openib.so libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x002a956c2000) libnsl.so.1 => /lib64/libnsl.so.1 (0x002a957cd000) libutil.so.1 => /lib64/libutil.so.1 (0x002a958e4000) libm.so.6 => /lib64/tls/libm.so.6 (0x002a959e8000) libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x002a95b6e000) libc.so.6 => /lib64/tls/libc.so.6 (0x002a95c83000) libdl.so.2 => /lib64/libdl.so.2 (0x002a95eb8000) /lib64/ld-linux-x86-64.so.2 (0x00552000) However, I note something in your last reply that I may have missed before -- can you clarify a point for me: are you saying that on your home machine, this generates the openib "file not found" warning: mpirun -np 2 hello but this does not: mpirun -np 2 --mca btl ^openib hello If so, can you confirm which version of Open MPI you are running? The only reason that I can think that that would happen is if you are running a trunk nightly download of Open MPI... If not, then there's something else going on that would be worth understanding. -- Jeff Squyres Cisco Systems
Re: [OMPI users] Cannot suppress openib error message
On 24 October 2007 at 16:22, Jeff Squyres wrote: | On Oct 24, 2007, at 4:16 PM, Dirk Eddelbuettel wrote: | | > I buy that explanation any day, but what is funny is that the | > btl = ^openib | > does suppress the warning on some of my systems (all running 1.2.4) | > but not | > others (also running 1.2.4). | | If I had to guess, the systems where you don't see the warning are | systems that have OFED loaded. I am pretty sure that none of the systems (at work) have IB hardware. I am very sure that my home systems do not, and there the 'btl = ^openib' successfully suppresses the warning --- whereas at work it doesn't. Must be a side-effect from something else. I made sure not lam libs were left around. Dirk -- Three out of two people have difficulties with fractions.
Re: [OMPI users] Cannot suppress openib error message
On Oct 24, 2007, at 4:16 PM, Dirk Eddelbuettel wrote: I buy that explanation any day, but what is funny is that the btl = ^openib does suppress the warning on some of my systems (all running 1.2.4) but not others (also running 1.2.4). If I had to guess, the systems where you don't see the warning are systems that have OFED loaded. -- Jeff Squyres Cisco Systems
Re: [OMPI users] Cannot suppress openib error message
Hi Jeff, On 24 October 2007 at 15:43, Jeff Squyres wrote: | This is quite likely because of a "feature" in how the OMPI v1.2 | series handles its plugins. In OMPI <=v1.2.x, Open MPI opens all | plugins that it can find and *then* applies the filter that you | provide (e.g., via the "btl" MCA param) to close / ignore certain | plugins. | | In OMPI >=v1.3, we [effectively] apply the filter *before* opening | plugins. So "--mca btl ^openib" will actually prevent the openib BTL | plugin from being loaded. | | I'm guessing that what you're seeing today is because we're opening | the openib BTL on a system where the OpenFabrics support libraries | are not available, and therefore the dlopen() fails. The error | string that we get back from libltdl is the somewhat-misleading "file | not found (ignored)", and that's what we print (note that ltdl is | referring to the fact that a dependent library is not found). I buy that explanation any day, but what is funny is that the btl = ^openib does suppress the warning on some of my systems (all running 1.2.4) but not others (also running 1.2.4). Hm. Dirk | On Oct 24, 2007, at 9:51 AM, Dirk Eddelbuettel wrote: | | > | > I've been scratching my head over this: | > | > lnx01:/usr/lib> orterun -n 2 --mca btl ^openib ~/c++/tests/mpitest | > [lnx01:14417] mca: base: component_find: unable to open btl openib: | > file not found (ignored) | > [lnx01:14418] mca: base: component_find: unable to open btl openib: | > file not found (ignored) | > Hello world, I'm process 0 | > Hello world, I'm process 1 | > lnx01:/usr/lib> grep openib /etc/openmpi/openmpi-mca-params.conf | > # btl = ^openib | > btl = ^openib | > lnx01:/usr/lib> orterun -n 2 ~/c++/tests/mpitest | > [lnx01:14429] mca: base: component_find: unable to open btl openib: | > file not found (ignored) | > [lnx01:14430] mca: base: component_find: unable to open btl openib: | > file not found (ignored) | > Hello world, I'm process 0 | > Hello world, I'm process 1 | > | > and when I strace it, I get | > | > uname({sys="Linux", node="lnx01", ...}) = 0 | > open("/etc/openmpi/openmpi-mca-params.conf", O_RDONLY) = 3 | > ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbf820698) = -1 ENOTTY | > (Inappropriate ioctl for device) | > fstat64(3, {st_mode=S_IFREG|0644, st_size=2877, ...}) = 0 | > mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, | > -1, 0) = 0xb7f72000 | > read(3, "#\n# Copyright (c) 2004-2005 The "..., 8192) = 2877 | > read(3, "", 4096) = 0 | > read(3, "", 8192) = 0 | > ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbf8205f8) = -1 ENOTTY | > (Inappropriate ioctl for device) | > close(3)= 0 | > munmap(0xb7f72000, 4096)= 0 | > | > Why can't I suppress the dreaded Infinityband message? | > | > System is Ubuntu 7.04 with 'ported' (ie locally recompiled) current | > Open MPI packages | > from Debian. | > | > Dirk | > | > -- | > Three out of two people have difficulties with fractions. | > ___ | > users mailing list | > us...@open-mpi.org | > http://www.open-mpi.org/mailman/listinfo.cgi/users | | | -- | Jeff Squyres | Cisco Systems | | ___ | users mailing list | us...@open-mpi.org | http://www.open-mpi.org/mailman/listinfo.cgi/users -- Three out of two people have difficulties with fractions.
Re: [OMPI users] Cannot suppress openib error message
This is quite likely because of a "feature" in how the OMPI v1.2 series handles its plugins. In OMPI <=v1.2.x, Open MPI opens all plugins that it can find and *then* applies the filter that you provide (e.g., via the "btl" MCA param) to close / ignore certain plugins. In OMPI >=v1.3, we [effectively] apply the filter *before* opening plugins. So "--mca btl ^openib" will actually prevent the openib BTL plugin from being loaded. I'm guessing that what you're seeing today is because we're opening the openib BTL on a system where the OpenFabrics support libraries are not available, and therefore the dlopen() fails. The error string that we get back from libltdl is the somewhat-misleading "file not found (ignored)", and that's what we print (note that ltdl is referring to the fact that a dependent library is not found). On Oct 24, 2007, at 9:51 AM, Dirk Eddelbuettel wrote: I've been scratching my head over this: lnx01:/usr/lib> orterun -n 2 --mca btl ^openib ~/c++/tests/mpitest [lnx01:14417] mca: base: component_find: unable to open btl openib: file not found (ignored) [lnx01:14418] mca: base: component_find: unable to open btl openib: file not found (ignored) Hello world, I'm process 0 Hello world, I'm process 1 lnx01:/usr/lib> grep openib /etc/openmpi/openmpi-mca-params.conf # btl = ^openib btl = ^openib lnx01:/usr/lib> orterun -n 2 ~/c++/tests/mpitest [lnx01:14429] mca: base: component_find: unable to open btl openib: file not found (ignored) [lnx01:14430] mca: base: component_find: unable to open btl openib: file not found (ignored) Hello world, I'm process 0 Hello world, I'm process 1 and when I strace it, I get uname({sys="Linux", node="lnx01", ...}) = 0 open("/etc/openmpi/openmpi-mca-params.conf", O_RDONLY) = 3 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbf820698) = -1 ENOTTY (Inappropriate ioctl for device) fstat64(3, {st_mode=S_IFREG|0644, st_size=2877, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f72000 read(3, "#\n# Copyright (c) 2004-2005 The "..., 8192) = 2877 read(3, "", 4096) = 0 read(3, "", 8192) = 0 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbf8205f8) = -1 ENOTTY (Inappropriate ioctl for device) close(3)= 0 munmap(0xb7f72000, 4096)= 0 Why can't I suppress the dreaded Infinityband message? System is Ubuntu 7.04 with 'ported' (ie locally recompiled) current Open MPI packages from Debian. Dirk -- Three out of two people have difficulties with fractions. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
[OMPI users] Cannot suppress openib error message
I've been scratching my head over this: lnx01:/usr/lib> orterun -n 2 --mca btl ^openib ~/c++/tests/mpitest [lnx01:14417] mca: base: component_find: unable to open btl openib: file not found (ignored) [lnx01:14418] mca: base: component_find: unable to open btl openib: file not found (ignored) Hello world, I'm process 0 Hello world, I'm process 1 lnx01:/usr/lib> grep openib /etc/openmpi/openmpi-mca-params.conf # btl = ^openib btl = ^openib lnx01:/usr/lib> orterun -n 2 ~/c++/tests/mpitest [lnx01:14429] mca: base: component_find: unable to open btl openib: file not found (ignored) [lnx01:14430] mca: base: component_find: unable to open btl openib: file not found (ignored) Hello world, I'm process 0 Hello world, I'm process 1 and when I strace it, I get uname({sys="Linux", node="lnx01", ...}) = 0 open("/etc/openmpi/openmpi-mca-params.conf", O_RDONLY) = 3 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbf820698) = -1 ENOTTY (Inappropriate ioctl for device) fstat64(3, {st_mode=S_IFREG|0644, st_size=2877, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f72000 read(3, "#\n# Copyright (c) 2004-2005 The "..., 8192) = 2877 read(3, "", 4096) = 0 read(3, "", 8192) = 0 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbf8205f8) = -1 ENOTTY (Inappropriate ioctl for device) close(3)= 0 munmap(0xb7f72000, 4096)= 0 Why can't I suppress the dreaded Infinityband message? System is Ubuntu 7.04 with 'ported' (ie locally recompiled) current Open MPI packages from Debian. Dirk -- Three out of two people have difficulties with fractions.