One more bit of information: These are QLogic IB cards, not Mellanox:
$ lspci | grep QL
05:00.0 InfiniBand: QLogic Corp. IBA7322 QDR InfiniBand HCA (rev 02)
On 7/28/20 2:03 PM, Prentice Bisbal wrote:
Last week I posted on here that I was getting immediate segfaults when
I ran MPI programs, and the system logs shows that the segfaults were
occuring in libibverbs.so, and that the problem was still occurring
even if I specified '-mca btl ^openib'.
Since then, I've made a lot of progress on the problem, and now my
jobs run, but I'm now getting this error sent to standard error:
WARNING: There was an error initializing an OpenFabrics device.
Local host: greene021
Local device: qib0
For the record, I'm using OpenMPI 4.0.3 running on CentOS 7.8,
compiled with GCC 9.3.0.
While researching the immediate segfault issue, I came across this Red
Hat Bug Report:
https://bugzilla.redhat.com/show_bug.cgi?id=1754099
According to that bug report, there was a regression in the version of
UCX that was provided with CentOS 7.8 (UCX 1.5.2-1.el7), and
downgrading to the UCX package that came with CentOS 7.7 (UCX
1.4.0-1.el7). Suspecting this might be the cause of my problem, I did
the same.
After the downgrade, my jobs still segfaulted, but at least I now got
a backtrace showing that the segfault was happening in UCX.
Now I suspected a bug in UCX, so I went to the UCX website and
installed the latest stable version (1.8.1) by building the SRPM
provided by the UCX website:
https://github.com/openucx/ucx/releases/tag/v1.8.1
After that, my application runs, but I get the error message above
(repeated here):
WARNING: There was an error initializing an OpenFabrics device.
Local host: greene021
Local device: qib0
Googling for that error message, I came across this OpenMPI bug
discussion:
https://github.com/open-mpi/ompi/issues/6517
According to this, if I rebuild OpenMPI with the option
''--without-verbs", that message will go away. I tried that, but I am
still getting the error message. Here's the configure command-line,
taken from ompi_info:
Configure command line:
'--prefix=/usr/pppl/gcc/9.3-pkgs/openmpi-4.0.3' '--with-ucx'
'--without-verbs' '--with-libfabric' '--with-libevent=/usr'
'--with-libevent-libdir=/usr/lib64' '--with-pmix=/usr/pppl/pmix/3.1.5'
'--with-pmi'
I have two questions:
1. How can I be sure that this message is really just a result of the
old openib code (as stated in the OpenMPI bug discussion above), and
my job is actually using InfiniBand with UCX?
2. If the message above is harmless, how can I make it go away so my
users don't see it?
If you've made it this far, thanks for reading my whole message. Any
help will be greatly appreciated!
--
Prentice
--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov