Last week I posted on here that I was getting immediate segfaults when I ran MPI programs, and the system logs shows that the segfaults were occuring in libibverbs.so, and that the problem was still occurring even if I specified '-mca btl ^openib'.

Since then, I've made a lot of progress on the problem, and now my jobs run, but I'm now getting this error sent to standard error:

WARNING: There was an error initializing an OpenFabrics device.

  Local host:   greene021
  Local device: qib0

For the record, I'm using OpenMPI 4.0.3 running on CentOS 7.8, compiled with GCC 9.3.0.

While researching the immediate segfault issue, I came across this Red Hat Bug Report:

https://bugzilla.redhat.com/show_bug.cgi?id=1754099

According to that bug report, there was a regression in the version of UCX that was provided with CentOS 7.8 (UCX 1.5.2-1.el7), and downgrading to the UCX package that came with CentOS 7.7 (UCX 1.4.0-1.el7). Suspecting this might be the cause of my problem, I did the same.

After the downgrade, my jobs still segfaulted, but at least I now got a backtrace showing that the segfault was happening in UCX.

Now I suspected a bug in UCX, so I went to the UCX website and installed the latest stable version (1.8.1) by building the SRPM provided by the UCX website:

https://github.com/openucx/ucx/releases/tag/v1.8.1

After that, my application runs, but I get the error message above (repeated here):

WARNING: There was an error initializing an OpenFabrics device.

  Local host:   greene021
  Local device: qib0

Googling for that error message, I came across this OpenMPI bug discussion:

https://github.com/open-mpi/ompi/issues/6517

According to this, if I rebuild OpenMPI with the option ''--without-verbs", that message will go away. I tried that, but I am still getting the error message. Here's the configure command-line, taken from ompi_info:

Configure command line: '--prefix=/usr/pppl/gcc/9.3-pkgs/openmpi-4.0.3' '--with-ucx' '--without-verbs' '--with-libfabric' '--with-libevent=/usr' '--with-libevent-libdir=/usr/lib64' '--with-pmix=/usr/pppl/pmix/3.1.5' '--with-pmi'

I have two questions:

1. How can I be sure that this message is really just a result of the old openib code (as stated in the OpenMPI bug discussion above), and my job is actually using InfiniBand with UCX?

2. If the message above is harmless, how can I make it go away so my users don't see it?

If you've made it this far, thanks for reading my whole message. Any help will be greatly appreciated!

--
Prentice

Reply via email to