Hello,
let me share our experience with MOFED/RHEL8/xCAT:
Actually we don't use xCAT to install MOFED but ansible afterwards,
either in genimage generated chroot or on initially xCAT installed
stateful nodes
Note that MOFED don't only support kernel point releases. It may out of
the box, with no need to rebuild, support errata kernel thanks to the
KMP/Weak modules mechanism. The keywork here is MAY. As a matter of
fact, as they state themselves it MAY also break :
Quoting their Overview:
"MLNX_OFED package for RedHat comes with RPMs that support KMP
(weak-modules), meaning that when a new errata kernel is installed,
compatibility links will be created under the weak-updates directory for
the new kernel. Those links allow using the existing MLNX_OFED kernel
modules without the need for recompilation. However, at times, the ABI
of the new kernel may not be compatible with the MLNX_OFED modules,
which will prevent loading them. In this case, the MLNX_OFED modules
must be rebuilt against the new kernel."
We experienced both cases indeed.
Note: their initial RPMS come with KMP enable, I think you have to use
--kmp if you want to pass it along your custom builds
Their distribution provides and uses if rebuild is asked the
kmp_compat.sh which, through depmod can detect kernel incompatibility
(and not install weak-modules)
However still a dnf upgrade of kernel may result in a broken system (if
ABI was broken) which is why MOFED is a pain in the ass.
Actual solution, which we mentionned several time to them but I don't
think they plan to implement either would be:
a) some dkms like mechanism (as for nvidia drivers - same company !)
b) at least some yum plugin preventing to upgrading to non compatible kernel
We are thinking about something like, in the stateless case, making our
ansible playbook crash based on no extra nor weak-updates directory,
sign of broken compatibility. So at least we won't boot to a broken
image. Still the stateful case would have to be managed as well.
Note that we also encountered buggy RPM building when rebuilding MOFED
rpms with add_kernel_support option with older (5.1) MOFED package as it
introduced a dependency loop with rdma-core being obsoleted by
mlnx-ofed-all and mlnx-ofed-all-useronly packages but needed by them
This was fixed by them by simply replacing
create_mlnx_ofed_installers.pl and mlnx_add_kernel_support.sh scripts
from MOFED 5.5
Finally, we install (with ansible but it may be used in your pkglist)
the following packages :
- mlnx-ofed-basic
- ibutils2
- qperf
Hope this helps
--
Thomas HUMMEL
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user