Public bug reported:

SRU Justification

[Impact]

The 5.4.0-1075-azure and newer kernels are broken in that the VM can
easily panic when the Mellanox VF NIC is removed and added due to Azure
host servicing events or the below manual "unbind/bind" test (here the
GUID can be different in different VMs):

for i in `seq 1 1000`;
do
    cd /sys/bus/vmbus/drivers/hv_pci;
    echo abdc2107-402e-4704-8c88-c2b850696c3c > unbind;
    echo abdc2107-402e-4704-8c88-c2b850696c3c > bind;
done

A sample panic call-trace is:
[ 107.359954] kernel BUG at 
/build/linux-azure-5.4-4I3kFs/linux-azure-5.4-5.4.0/mm/slub.c:4020!
[ 107.363858] invalid opcode: 0000 [#1] SMP NOPTI
[ 107.365870] CPU: 0 PID: 334 Comm: kworker/0:2 Not tainted 5.4.0-1077-azure 
#80~18.04.1-Ubuntu
[ 107.369589] Hardware name: Microsoft Corporation Virtual Machine/Virtual 
Machine, BIOS 090008 12/07/2018
[ 107.373811] Workqueue: events vmbus_onmessage_work
[ 107.375909] RIP: 0010:kfree+0x1d2/0x240
…
[ 107.413789] Call Trace:
[ 107.414867] kobject_uevent_env+0x1b5/0x7e0
[ 107.416747] kobject_uevent+0xb/0x10
[ 107.418327] device_release_driver_internal+0x191/0x1c0
[ 107.420653] device_release_driver+0x12/0x20
[ 107.422523] bus_remove_device+0xe1/0x150
[ 107.424279] device_del+0x167/0x380
[ 107.425824] device_unregister+0x1a/0x60
[ 107.427536] vmbus_device_unregister+0x27/0x50
[ 107.429528] vmbus_onoffer_rescind+0x1d0/0x1f0
[ 107.431474] vmbus_onmessage+0x2c/0x70
[ 107.433104] vmbus_onmessage_work+0x22/0x30
[ 107.434919] process_one_work+0x209/0x400
[ 107.436661] worker_thread+0x34/0x40

It turns out there is a bug in https://git.launchpad.net/~canonical-
kernel/ubuntu/+source/linux-azure/+git/bionic/commit/?id=16a3c750a78d8,
which misses the second hunk of the upstream patch
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=877b911a5ba0.

Please apply the below patch to fix the issue:

--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3653,7 +3653,7 @@ static int hv_pci_remove(struct hv_device *hdev)

        hv_put_dom_num(hbus->bridge->domain_nr);

- free_page((unsigned long)hbus);
+ kfree(hbus);
        return ret;
 }

BTW, please apply this patch as well (Note: this patch is not really required 
as it's only for error handling path, which is usually unlikely):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=42c3d41832ef4fcf60aaa6f748de01ad99572adf

[Test Case]

Microsoft tested

** Affects: linux-azure (Ubuntu)
     Importance: Undecided
         Status: Invalid

** Affects: linux-azure (Ubuntu Focal)
     Importance: Medium
     Assignee: Tim Gardner (timg-tpi)
         Status: In Progress

** Package changed: linux (Ubuntu) => linux-azure (Ubuntu)

** Changed in: linux-azure (Ubuntu)
       Status: New => Invalid

** Also affects: linux-azure (Ubuntu Focal)
   Importance: Undecided
       Status: New

** Changed in: linux-azure (Ubuntu Focal)
   Importance: Undecided => Medium

** Changed in: linux-azure (Ubuntu Focal)
       Status: New => In Progress

** Changed in: linux-azure (Ubuntu Focal)
     Assignee: (unassigned) => Tim Gardner (timg-tpi)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1973758

Title:
  Azure:  Mellanox VF NIC crashes when removed

Status in linux-azure package in Ubuntu:
  Invalid
Status in linux-azure source package in Focal:
  In Progress

Bug description:
  SRU Justification

  [Impact]

  The 5.4.0-1075-azure and newer kernels are broken in that the VM can
  easily panic when the Mellanox VF NIC is removed and added due to
  Azure host servicing events or the below manual "unbind/bind" test
  (here the GUID can be different in different VMs):

  for i in `seq 1 1000`;
  do
      cd /sys/bus/vmbus/drivers/hv_pci;
      echo abdc2107-402e-4704-8c88-c2b850696c3c > unbind;
      echo abdc2107-402e-4704-8c88-c2b850696c3c > bind;
  done

  A sample panic call-trace is:
  [ 107.359954] kernel BUG at 
/build/linux-azure-5.4-4I3kFs/linux-azure-5.4-5.4.0/mm/slub.c:4020!
  [ 107.363858] invalid opcode: 0000 [#1] SMP NOPTI
  [ 107.365870] CPU: 0 PID: 334 Comm: kworker/0:2 Not tainted 5.4.0-1077-azure 
#80~18.04.1-Ubuntu
  [ 107.369589] Hardware name: Microsoft Corporation Virtual Machine/Virtual 
Machine, BIOS 090008 12/07/2018
  [ 107.373811] Workqueue: events vmbus_onmessage_work
  [ 107.375909] RIP: 0010:kfree+0x1d2/0x240
  …
  [ 107.413789] Call Trace:
  [ 107.414867] kobject_uevent_env+0x1b5/0x7e0
  [ 107.416747] kobject_uevent+0xb/0x10
  [ 107.418327] device_release_driver_internal+0x191/0x1c0
  [ 107.420653] device_release_driver+0x12/0x20
  [ 107.422523] bus_remove_device+0xe1/0x150
  [ 107.424279] device_del+0x167/0x380
  [ 107.425824] device_unregister+0x1a/0x60
  [ 107.427536] vmbus_device_unregister+0x27/0x50
  [ 107.429528] vmbus_onoffer_rescind+0x1d0/0x1f0
  [ 107.431474] vmbus_onmessage+0x2c/0x70
  [ 107.433104] vmbus_onmessage_work+0x22/0x30
  [ 107.434919] process_one_work+0x209/0x400
  [ 107.436661] worker_thread+0x34/0x40

  It turns out there is a bug in https://git.launchpad.net/~canonical-
  kernel/ubuntu/+source/linux-
  azure/+git/bionic/commit/?id=16a3c750a78d8, which misses the second
  hunk of the upstream patch
  
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=877b911a5ba0.

  Please apply the below patch to fix the issue:

  --- a/drivers/pci/controller/pci-hyperv.c
  +++ b/drivers/pci/controller/pci-hyperv.c
  @@ -3653,7 +3653,7 @@ static int hv_pci_remove(struct hv_device *hdev)

          hv_put_dom_num(hbus->bridge->domain_nr);

  - free_page((unsigned long)hbus);
  + kfree(hbus);
          return ret;
   }

  BTW, please apply this patch as well (Note: this patch is not really required 
as it's only for error handling path, which is usually unlikely):
  
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=42c3d41832ef4fcf60aaa6f748de01ad99572adf

  [Test Case]

  Microsoft tested

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1973758/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to