Last week I posted on here that I was getting immediate segfaults when I
ran MPI programs, and the system logs shows that the segfaults were
occuring in libibverbs.so, and that the problem was still occurring even
if I specified '-mca btl ^openib'.
Since then, I've made a lot of progress on the problem, and now my jobs
run, but I'm now getting this error sent to standard error:
WARNING: There was an error initializing an OpenFabrics device.
Local host: greene021
Local device: qib0
For the record, I'm using OpenMPI 4.0.3 running on CentOS 7.8, compiled
with GCC 9.3.0.
While researching the immediate segfault issue, I came across this Red
Hat Bug Report:
https://bugzilla.redhat.com/show_bug.cgi?id=1754099
According to that bug report, there was a regression in the version of
UCX that was provided with CentOS 7.8 (UCX 1.5.2-1.el7), and downgrading
to the UCX package that came with CentOS 7.7 (UCX 1.4.0-1.el7).
Suspecting this might be the cause of my problem, I did the same.
After the downgrade, my jobs still segfaulted, but at least I now got a
backtrace showing that the segfault was happening in UCX.
Now I suspected a bug in UCX, so I went to the UCX website and installed
the latest stable version (1.8.1) by building the SRPM provided by the
UCX website:
https://github.com/openucx/ucx/releases/tag/v1.8.1
After that, my application runs, but I get the error message above
(repeated here):
WARNING: There was an error initializing an OpenFabrics device.
Local host: greene021
Local device: qib0
Googling for that error message, I came across this OpenMPI bug discussion:
https://github.com/open-mpi/ompi/issues/6517
According to this, if I rebuild OpenMPI with the option
''--without-verbs", that message will go away. I tried that, but I am
still getting the error message. Here's the configure command-line,
taken from ompi_info:
Configure command line: '--prefix=/usr/pppl/gcc/9.3-pkgs/openmpi-4.0.3'
'--with-ucx' '--without-verbs' '--with-libfabric' '--with-libevent=/usr'
'--with-libevent-libdir=/usr/lib64' '--with-pmix=/usr/pppl/pmix/3.1.5'
'--with-pmi'
I have two questions:
1. How can I be sure that this message is really just a result of the
old openib code (as stated in the OpenMPI bug discussion above), and my
job is actually using InfiniBand with UCX?
2. If the message above is harmless, how can I make it go away so my
users don't see it?
If you've made it this far, thanks for reading my whole message. Any
help will be greatly appreciated!
--
Prentice