Public bug reported:

On Intel Xeon processors newer than the E5 v4 family, calling rmdir() on
a resctrl monitor-only group causes a segmentation fault in kernel.
After the segfault many operation will hang including the bug report
command `ubuntu-bug linux`. Even the `reboot` command hangs and a
hardware reset is required to restore the normal state.

Reproduction steps:

1. Confirm that we're on the latest hwe kernel for 16.04
(4.15.0-96-generic for now)

```
$ uname -a
Linux <hostname> 4.15.0-96-generic #97~16.04.1-Ubuntu SMP Wed Apr 1 03:03:31 
UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
```

2. Confirm that we have a Intel RDT Memory Bandwidth Monitoring capable
CPU (mine is E5-2690 v4)

```
$ lscpu
...
Model name: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
...
```

3. Execute the following command as root to create a resctrl monitor
group

```
# mount -t resctrl resctrl /sys/fs/resctrl
# mkdir /sys/fs/resctrl/mon_groups/test
# ls /sys/fs/resctrl/mon_groups/test
cpus cpus_list mon_data tasks
```

We can see that the monitor group is created normally.

4. Remove the newly-created monitor group, and segfault happens

```
# rmdir /sys/fs/resctrl/mon_groups/test
Segmentation fault
```

Guesses:

I believe that there is a bug in Bionic kernel's upstream stable
patchset 2020-02-26
(https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1864904). The
corresponding commit is `c9c5f0ce9900a99433bb44e88ccc89665be15a07`
("x86/resctrl: Fix use-after-free due to inaccurate refcount of
rdtgroup"), which puts the free function `free_all_child_rdtgrp()` in
the wrong place.

The commit above fixes a race condition when removing a resctrl control
group. Commit message says `Fix it by moving free_all_child_rdtgrp() to
after kernfs_remove() in rdtgroup_rmdir_ctrl() to ensure it has the
accurate refcount of rdtgrp` but the commit actually moves
`free_all_child_rdtgrp()` to another function named
`rdtgroup_rmdir_mon()`. Additionally, the "backporting notes" section in
the commit message is also confusing. It points out that the function
modified in upstream commit `fa7d949337cc` ("x86/resctrl: Rename and
move rdt files to a separate directory") is related to control group,
but it mentions monitor group related function in stable trees.

Since I'm using the latest HWE kernel for 16.04 which backports Bionic's
kernel patches, I encountered this issue in 16.04.

Fixes and test results:

I moved `free_all_child_rdtgrp()` back to the original function
`rdtgroup_rmdir_ctrl()`, right after `kernfs_remove()` according to the
original commit message, compiled it and booted into the modified
kernel. It turns out that the segfault no longer happens.

I created a patch based on Bionic kernel's master branch. I have no
knowledge about x86 architecture so I'm not sure that whether it is the
correct way to fix the issue. Hopefully someone can have it reviewed and
I will try to sumbit a kernel patch (I have no experience about this
before... sorry about that). Thanks!

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.15.0-96-generic 4.15.0-96.97~16.04.1
ProcVersionSignature: Ubuntu 4.15.0-96.97~16.04.1-generic 4.15.18
Uname: Linux 4.15.0-96-generic x86_64
NonfreeKernelModules: nvidia_uvm nvidia_drm nvidia_modeset nvidia
ApportVersion: 2.20.1-0ubuntu2.23
Architecture: amd64
Date: Thu Apr 16 11:01:58 2020
InstallationDate: Installed on 2018-10-30 (533 days ago)
InstallationMedia: Ubuntu 16.04.4 LTS "Xenial Xerus" - Release amd64 (20180228)
SourcePackage: linux-signed-hwe
UpgradeStatus: No upgrade log present (probably fresh install)

** Affects: linux-signed-hwe (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: amd64 apport-bug intel-rdt kernel-bug xenial

** Patch added: 
"0001-move-free_all_child_rdtgrp-back-to-rdtgroup_rmdir_ct.patch"
   
https://bugs.launchpad.net/bugs/1873126/+attachment/5355027/+files/0001-move-free_all_child_rdtgrp-back-to-rdtgroup_rmdir_ct.patch

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1873126

Title:
  Calling rmdir() on a resctrl monitor group results in segmentation
  fault and hangs the system

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1873126/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to