Public bug reported: On Intel Xeon processors newer than the E5 v4 family, calling rmdir() on a resctrl monitor-only group causes a segmentation fault in kernel. After the segfault many operation will hang including the bug report command `ubuntu-bug linux`. Even the `reboot` command hangs and a hardware reset is required to restore the normal state.
Reproduction steps: 1. Confirm that we're on the latest hwe kernel for 16.04 (4.15.0-96-generic for now) ``` $ uname -a Linux <hostname> 4.15.0-96-generic #97~16.04.1-Ubuntu SMP Wed Apr 1 03:03:31 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux ``` 2. Confirm that we have a Intel RDT Memory Bandwidth Monitoring capable CPU (mine is E5-2690 v4) ``` $ lscpu ... Model name: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz ... ``` 3. Execute the following command as root to create a resctrl monitor group ``` # mount -t resctrl resctrl /sys/fs/resctrl # mkdir /sys/fs/resctrl/mon_groups/test # ls /sys/fs/resctrl/mon_groups/test cpus cpus_list mon_data tasks ``` We can see that the monitor group is created normally. 4. Remove the newly-created monitor group, and segfault happens ``` # rmdir /sys/fs/resctrl/mon_groups/test Segmentation fault ``` Guesses: I believe that there is a bug in Bionic kernel's upstream stable patchset 2020-02-26 (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1864904). The corresponding commit is `c9c5f0ce9900a99433bb44e88ccc89665be15a07` ("x86/resctrl: Fix use-after-free due to inaccurate refcount of rdtgroup"), which puts the free function `free_all_child_rdtgrp()` in the wrong place. The commit above fixes a race condition when removing a resctrl control group. Commit message says `Fix it by moving free_all_child_rdtgrp() to after kernfs_remove() in rdtgroup_rmdir_ctrl() to ensure it has the accurate refcount of rdtgrp` but the commit actually moves `free_all_child_rdtgrp()` to another function named `rdtgroup_rmdir_mon()`. Additionally, the "backporting notes" section in the commit message is also confusing. It points out that the function modified in upstream commit `fa7d949337cc` ("x86/resctrl: Rename and move rdt files to a separate directory") is related to control group, but it mentions monitor group related function in stable trees. Since I'm using the latest HWE kernel for 16.04 which backports Bionic's kernel patches, I encountered this issue in 16.04. Fixes and test results: I moved `free_all_child_rdtgrp()` back to the original function `rdtgroup_rmdir_ctrl()`, right after `kernfs_remove()` according to the original commit message, compiled it and booted into the modified kernel. It turns out that the segfault no longer happens. I created a patch based on Bionic kernel's master branch. I have no knowledge about x86 architecture so I'm not sure that whether it is the correct way to fix the issue. Hopefully someone can have it reviewed and I will try to sumbit a kernel patch (I have no experience about this before... sorry about that). Thanks! ProblemType: Bug DistroRelease: Ubuntu 16.04 Package: linux-image-4.15.0-96-generic 4.15.0-96.97~16.04.1 ProcVersionSignature: Ubuntu 4.15.0-96.97~16.04.1-generic 4.15.18 Uname: Linux 4.15.0-96-generic x86_64 NonfreeKernelModules: nvidia_uvm nvidia_drm nvidia_modeset nvidia ApportVersion: 2.20.1-0ubuntu2.23 Architecture: amd64 Date: Thu Apr 16 11:01:58 2020 InstallationDate: Installed on 2018-10-30 (533 days ago) InstallationMedia: Ubuntu 16.04.4 LTS "Xenial Xerus" - Release amd64 (20180228) SourcePackage: linux-signed-hwe UpgradeStatus: No upgrade log present (probably fresh install) ** Affects: linux-signed-hwe (Ubuntu) Importance: Undecided Status: New ** Tags: amd64 apport-bug intel-rdt kernel-bug xenial ** Patch added: "0001-move-free_all_child_rdtgrp-back-to-rdtgroup_rmdir_ct.patch" https://bugs.launchpad.net/bugs/1873126/+attachment/5355027/+files/0001-move-free_all_child_rdtgrp-back-to-rdtgroup_rmdir_ct.patch -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1873126 Title: Calling rmdir() on a resctrl monitor group results in segmentation fault and hangs the system To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1873126/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs