Hello,

let me share our experience with MOFED/RHEL8/xCAT:

Actually we don't use xCAT to install MOFED but ansible afterwards, either in genimage generated chroot or on initially xCAT installed stateful nodes

Note that MOFED don't only support kernel point releases. It may out of the box, with no need to rebuild, support errata kernel thanks to the KMP/Weak modules mechanism. The keywork here is MAY. As a matter of fact, as they state themselves it MAY also break :

Quoting their Overview:

"MLNX_OFED package for RedHat comes with RPMs that support KMP (weak-modules), meaning that when a new errata kernel is installed, compatibility links will be created under the weak-updates directory for the new kernel. Those links allow using the existing MLNX_OFED kernel modules without the need for recompilation. However, at times, the ABI of the new kernel may not be compatible with the MLNX_OFED modules, which will prevent loading them. In this case, the MLNX_OFED modules must be rebuilt against the new kernel."

We experienced both cases indeed.

Note: their initial RPMS come with KMP enable, I think you have to use --kmp if you want to pass it along your custom builds

Their distribution provides and uses if rebuild is asked the kmp_compat.sh which, through depmod can detect kernel incompatibility (and not install weak-modules)

However still a dnf upgrade of kernel may result in a broken system (if ABI was broken) which is why MOFED is a pain in the ass.

Actual solution, which we mentionned several time to them but I don't think they plan to implement either would be:

a) some dkms like mechanism (as for nvidia drivers - same company !)
b) at least some yum plugin preventing to upgrading to non compatible kernel

We are thinking about something like, in the stateless case, making our ansible playbook crash based on no extra nor weak-updates directory, sign of broken compatibility. So at least we won't boot to a broken image. Still the stateful case would have to be managed as well.

Note that we also encountered buggy RPM building when rebuilding MOFED rpms with add_kernel_support option with older (5.1) MOFED package as it introduced a dependency loop with rdma-core being obsoleted by mlnx-ofed-all and mlnx-ofed-all-useronly packages but needed by them This was fixed by them by simply replacing create_mlnx_ofed_installers.pl and mlnx_add_kernel_support.sh scripts from MOFED 5.5

Finally, we install (with ansible but it may be used in your pkglist) the following packages :

  - mlnx-ofed-basic
   - ibutils2
   - qperf

Hope this helps

--
Thomas HUMMEL


_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to