Launchpad has imported 27 comments from the remote bug at
https://bugzilla.kernel.org/show_bug.cgi?id=199779.

If you reply to an imported comment from within Launchpad, your comment
will be sent to the remote bug automatically. Read more about
Launchpad's inter-bugtracker facilities at
https://help.launchpad.net/InterBugTracking.

------------------------------------------------------------------------
On 2018-05-21T07:03:07+00:00 ryan wrote:

Created attachment 276079
lspci -vv

On HPe DL360 Gen9 (and possibly other gens and/or products; I haven't
been able to test other HP hardware right now, but I do have several
DL360 Gen9s I've confirmed on), upon shutdown/reboot, it will crash
with:

[  122.447111] {1}[Hardware Error]: Hardware error from APEI Generic Hardware 
Error Source: 1
[  122.447112] {1}[Hardware Error]: event severity: fatal
[  122.447113] {1}[Hardware Error]:  Error 0, type: fatal
[  122.447114] {1}[Hardware Error]:   section_type: PCIe error
[  122.447115] {1}[Hardware Error]:   port_type: 4, root port
[  122.447116] {1}[Hardware Error]:   version: 1.16
[  122.447118] {1}[Hardware Error]:   command: 0x6010, status: 0x0143
[  122.447119] {1}[Hardware Error]:   device_id: 0000:00:01.0
[  122.447119] {1}[Hardware Error]:   slot: 0
[  122.447120] {1}[Hardware Error]:   secondary_bus: 0x03
[  122.447120] {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x2f02
[  122.447121] {1}[Hardware Error]:   class_code: 040600
[  122.447122] {1}[Hardware Error]:   bridge: secondary_status: 0x2000, 
control: 0x0003
[  122.447123] {1}[Hardware Error]:  Error 1, type: fatal
[  122.447123] {1}[Hardware Error]:   section_type: PCIe error
[  122.447124] {1}[Hardware Error]:   port_type: 4, root port
[  122.447125] {1}[Hardware Error]:   version: 1.16
[  122.447125] {1}[Hardware Error]:   command: 0x6010, status: 0x0143
[  122.447126] {1}[Hardware Error]:   device_id: 0000:00:01.0
[  122.447127] {1}[Hardware Error]:   slot: 0
[  122.447127] {1}[Hardware Error]:   secondary_bus: 0x03
[  122.447128] {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x2f02
[  122.447129] {1}[Hardware Error]:   class_code: 040600
[  122.447130] {1}[Hardware Error]:   bridge: secondary_status: 0x2000, 
control: 0x0003
[  122.447131] Kernel panic - not syncing: Fatal hardware error!
[  122.447166] Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation 
range: 0xffffffff80000000-0xffffffffbfffffff)
[  122.459295] ERST: [Firmware Warn]: Firmware does not respond in time.

And after that, upon POST, the storage controller is not happy but does
eventually work:

Embedded RAID 1 : Smart Array P440ar Controller - (2048 MB, V6.30) 7 Logical
Drive(s) - Operation Failed
 - 1719-Slot 0 Drive Array - A controller failure event occurred prior
   to this power-up.  (Previous lock up code = 0x13) Action: Install the
   latest controller firmware. If the problem persists, replace the
   controller.

Up to date firmware (P89 01/22/2018, controller 6.30).  Interestingly,
on older (circa 2016 but I don't have an exact version) firmware, this
manifested as a crash loop:

[529151.035267] NMI: IOCK error (debug interrupt?) for reason 75 on CPU 0.
[529153.222883] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.222884] Do you have a strange power saving mode enabled?
[529153.222884] Dazed and confused, but trying to continue
[529153.554447] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.554448] Do you have a strange power saving mode enabled?
[529153.554449] Dazed and confused, but trying to continue

I've narrowed it down to https://patchwork.kernel.org/patch/10027157/ as
part of commit 1b6115fbe3b3db746d7baa11399dd617fc75e1c4; removing that
line prevents the panic.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/4

------------------------------------------------------------------------
On 2018-05-21T11:11:51+00:00 okaya wrote:

Can you test this patch?

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-
next.git/commit/drivers/pci/hotplug?id=d22b362184553899f7d6b6760899a77d3b2d7c1b

There is a known Intel errata that we missed.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/5

------------------------------------------------------------------------
On 2018-05-21T11:24:20+00:00 okaya wrote:

can you also share your dmesg?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/6

------------------------------------------------------------------------
On 2018-05-21T19:31:17+00:00 ryan wrote:

Created attachment 276103
4.17.0-rc5-next-20180517 dmesg

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/7

------------------------------------------------------------------------
On 2018-05-21T19:32:26+00:00 ryan wrote:

Thanks, but same problem with that patch against 4.15.  Even tried
next-20180517 to be sure, no luck.  dmesg against next-20180517 has been
attached.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/8

------------------------------------------------------------------------
On 2018-05-21T19:38:19+00:00 okaya wrote:

Cool, I had my suspicions. That's why, I asked for dmesg. Your system
doesn't seem to have hotplug driver loaded. The bugfix above is valid
only if you have hotplug driver enabled. Something else must be
happening.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/9

------------------------------------------------------------------------
On 2018-05-21T19:52:07+00:00 okaya wrote:

it looks like PME is the only PCIe port service driver loaded. Can you
empty out this line to see if it makes any difference? Then, we can
start going deeper based on your test result.

https://elixir.bootlin.com/linux/latest/ident/pcie_pme_remove

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/10

------------------------------------------------------------------------
On 2018-05-21T21:05:30+00:00 ryan wrote:

Created attachment 276111
pcie_pme_remove removed, crash

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/11

------------------------------------------------------------------------
On 2018-05-21T21:06:16+00:00 ryan wrote:

-       .remove         = pcie_pme_remove,

With that removed, the crash becomes:

[  115.008578] kernel BUG at drivers/pci/msi.c:352!
[  115.069730] invalid opcode: 0000 [#1] SMP PTI
[  115.127399] CPU: 15 PID: 1 Comm: systemd-shutdow Not tainted 
4.17.0-rc5-next-20180517-custom #1
[  115.242735] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS 
P89 01/22/2018
[  115.351050] RIP: 0010:free_msi_irqs+0x17b/0x1b0
[  115.410250] Code: 84 e1 fe ff ff 45 31 f6 eb 11 41 83 c6 01 44 39 73 14 0f 
86 ce fe ff ff 8b 7b 10 44 01 f7 e8 7c f4 bb ff 48 83 78 70 00 74 e0 <0f> 0b 49 
8d b5 a0 00 00 00 e8 b7 a0 bc ff e9 cf fe ff ff 48 8b 78 
[...]

Full output attached.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/12

------------------------------------------------------------------------
On 2018-05-21T21:08:06+00:00 okaya wrote:

Oops. can you comment out this line only?

https://elixir.bootlin.com/linux/latest/source/drivers/pci/pcie/pme.c#L431

We have to call free_irq(). I went too aggressive at the problem.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/13

------------------------------------------------------------------------
On 2018-05-21T21:30:49+00:00 ryan wrote:

Commented out "pcie_pme_suspend(srv);", back to original Hardware Error
crash.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/14

------------------------------------------------------------------------
On 2018-05-21T21:34:55+00:00 okaya wrote:

Weird. I'll come up with a debug patch. Can you collect some more data
as to what other systems see this issue in the meantime?

Since you are the first one to report the problem, there must be
something unique about your setup.

Also, please attach sudo lspci -t output too.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/15

------------------------------------------------------------------------
On 2018-05-21T21:42:02+00:00 ryan wrote:

Sure.  I'm seeing this on a set of 4 DL360 Gen9s, I believe they were
all purchased at the same time around 2016.  I'll look around for
further machines I can test on, looking for:

1) DL360 Gen9s but not from the same batch as these
2) Previous gens (not sure we have any older ones)
3) DL380 Gen9

Attaching lspci -t.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/16

------------------------------------------------------------------------
On 2018-05-21T21:42:24+00:00 ryan wrote:

Created attachment 276113
lspci -t

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/17

------------------------------------------------------------------------
On 2018-05-21T22:03:33+00:00 okaya wrote:

Created attachment 276115
debug_patch.patch

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/18

------------------------------------------------------------------------
On 2018-05-22T01:24:49+00:00 ryan wrote:

I was able to test on another DL360 Gen9 received about a year after the
ones I discovered on, same problem.  And a DL380 Gen9 with similar
specs, also crashes.  I was able to test on a DL380 Gen10, which did
*not* crash.  In summary:

Bad: DL360 Gen9 - BIOS P89 v2.56 (01/22/2018) - P440ar V6.30 (originals)
Bad: DL360 Gen9 - BIOS P89 v2.52 (10/25/2017) - P440ar V6.06 (newer)
Bad: DL380 Gen9 - BIOS P89 v2.52 (10/25/2017) - P440ar V6.06 (newer)
Good: DL380 Gen10 - U30 v1.32 (02/01/2018) - P408i-a 1.04-0 (even newer)

Attached is the output from your debug patch on the original test
system.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/19

------------------------------------------------------------------------
On 2018-05-22T01:25:25+00:00 ryan wrote:

Created attachment 276121
debug patch output

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/20

------------------------------------------------------------------------
On 2018-05-22T01:43:38+00:00 okaya wrote:

Many thanks, let's try these tests. Debug prints are not giving me any
clues. The error seems to be asynchronous to the code execution. We'll
have to find out by trial and error which one is confusing the HW. My
bet is on the first one followed by the third.

1. comment out this line only.

https://elixir.bootlin.com/linux/v4.17-rc6/source/drivers/pci/pcie/portdrv_core.c#L412

2. Comment out this line only.

https://elixir.bootlin.com/linux/v4.17-rc6/source/drivers/pci/pcie/portdrv_pci.c#L148

3. Comment out the if block only.

https://elixir.bootlin.com/linux/v4.17-rc6/source/drivers/pci/pcie/portdrv_pci.c#L142

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/22

------------------------------------------------------------------------
On 2018-05-22T02:24:18+00:00 ryan wrote:

Progress!  #1 reboots correctly.

A) I had reverted out the debug print patch, want me to add it back?  Does it 
give you any extra insight?
B) Should I move on to #2 and #3?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/23

------------------------------------------------------------------------
On 2018-05-22T02:34:51+00:00 okaya wrote:

No, this is enough. We now understand that disabling the bus master bit
in the command-control register of the root port is causing a crash on
your system.

I suspect that the firmware is talking to the PCIe bus in parallel and
by disabling the bus master bit, we are breaking the FW.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/24

------------------------------------------------------------------------
On 2018-05-22T02:39:28+00:00 okaya wrote:

Can you also attach the messages you are seeing during shutdown/reboot?
The driver clean up order could be important too.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/25

------------------------------------------------------------------------
On 2018-05-22T03:42:25+00:00 ryan wrote:

Created attachment 276123
shutdown log

[`dmesg -n debug` added, otherwise normal systemd-obfuscated user
messages]

This particular test machine is a MAAS server, 4 interfaces, 2 bonds, 2
bridges.  It normally runs a KVM instance directly, but I don't have it
set up to autoboot to save time while testing.

Functionally, the other machines tested don't have a common operational
trait: OpenStack "smoosh" (nova-compute + n-c-c + neutron + swift + ceph
etc in LXDs), a straight Apache archive server, a standby firewall.
Actually, they all appear to be at least partially utilizing 10gige
interfaces (hopefully that's not a consideration since I'm not sure if I
can pull a straight gigabit machine out of active use to test on short
notice).

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/26

------------------------------------------------------------------------
On 2018-05-22T03:56:49+00:00 okaya wrote:

can you apply debug_patch.patch +

1. comment out this line only.

https://elixir.bootlin.com/linux/v4.17-rc6/source/drivers/pci/pcie/portdrv_core.c#L412

and collect shutdown log one more time.

I see quite a bit of driver shutdown activity from your network
adapters. I want to see them in reference to the port service driver
shutdown to see which one is happening first and last.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/27

------------------------------------------------------------------------
On 2018-05-22T17:14:03+00:00 bhelgaas wrote:

I am not yet convinced that it is necessary for
pcie_port_device_remove() to call pci_disable_device() on PCIe Root
Ports and Switch Ports during a reboot.

A similar question came during discussion of pciehp timeouts during
shutdown [1].  Eric Biederman had a good response [2] that I haven't had
time to assimilate yet.

[1] https://lkml.kernel.org/r/8770820b-85a0-172b-7230-3a44524e6...@molgen.mpg.de
[2] https://lkml.kernel.org/r/87tvrrypkv....@xmission.com

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/28

------------------------------------------------------------------------
On 2018-05-22T17:33:13+00:00 okaya wrote:

I think the motivation is for rogue transactions from the devices not to
hit the system memory while a new kernel is booted via kexec.

It is not an issue when IOMMU is not present since the second kernel
that is booting doesn't share the same address space.

However; when IOMMU is present, an adapter can corrupt the newly booting
kernel. So, you ideally want to have bus master bit cleared for a clean
boot.

What is interesting is that kexec is already doing this job in
pci_device_shutdown(). This extra clear is unnecessary. I'll post a
patch to remove it.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/29

------------------------------------------------------------------------
On 2018-06-11T18:28:26+00:00 okaya wrote:

change merged to the 4.18 kernel:

https://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git/commit/?id=0d98ba8d70b0070ac117452ea0b663e26bbf46bf

This issue can be closed.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/31

------------------------------------------------------------------------
On 2018-06-19T18:20:40+00:00 ryan wrote:

Ack, thank you for all your help.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1771467/comments/34


** Changed in: linux
       Status: Unknown => Fix Released

** Changed in: linux
   Importance: Unknown => Medium

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1771467

Title:
  Reboot/shutdown kernel panic on HP DL360/DL380 Gen9 w/ bionic 4.15.0

To manage notifications about this bug go to:
https://bugs.launchpad.net/linux/+bug/1771467/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to