[Bug 1681909] Comment bridged from LTC Bugzilla
--- Comment From mu...@br.ibm.com 2018-03-06 11:15 EDT--- (In reply to comment #45) > Hi, Murilo. > > Can you test it on 16.04 using kdump-tools from xenial-proposed? Maybe the > noirqdistrib option might be related to the EEH issues. > Ok, I'll give it a try. (In reply to comment #46) > Looking at the log, I noticed the EEH is frozen right after finding the > Broadcom card. Is that one the tg3? > > [ OK ] Found device NetXtreme BCM5719 Gigabit Ethernet PCIe. > [8.191135] EEH: Frozen PE#7 on PHB#21 detected > [8.191280] EEH: PE location: S00210f, PHB location: N/A Yeah correct, this is the tg3 device. But the EEH is seen in a PHB different then the one the adapter is in. This adapter is PHB#01, where the EEH is seen in the PHB#21. > > Also, the recovery problem seems to be caused by ast. > > [ 18.267005] EEH: 210 reads ignored for recovering device at > location=S00210f driver=ast pci addr=0021:10:00.0 > [ 18.267334] EEH: Might be infinite loop in ast driver > > Looking at the upstream logs, one commit came up. Can you open a new bug for > it? > > commit 298360af3dab45659810fdc51aba0c9f4097e4f6 > Author: Russell Currey > Date: Thu Dec 15 16:12:41 2016 +1100 > > drivers/gpu/drm/ast: Fix infinite loop if read fails Cascardo, about the mentioned patch, it is already in this kernel, when I look at the changelog for linux-image-4.4.0-116-generic: * Xenial update to v4.4.41 stable release (LP: #1655041) - drivers/gpu/drm/ast: Fix infinite loop if read fails And also this is not the only device that is hitting the EEH, when I blacklisted the ast module I still see the EEH hitting the other slots behind the PLX switch I was able to collect a full dmesg output by adding the dmesg command to the KDUMP_FAIL_CMD option, still no luck in getting it to drop to a shell. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1681909 Title: dump is not captured in remote host when kdump over ssh is configured on firestone. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1681909] Comment bridged from LTC Bugzilla
--- Comment From mu...@br.ibm.com 2018-03-06 09:38 EDT--- Cascardo, I gave a try with kdump in Ubuntu 16.04 and it seems to occasionally fail. It seems to be kind of random when it decides to fail, I see that we are hitting an EEH in slots behind a PLX switch, but even in successful attempts we hit the EEH as well. So not sure how much it is related. I collected the console log of the failure attempt (I am attaching it). I am attempting to drop into a shell by setting sh or bash to run in the KDUMP_FAIL_CMD option, but it seems to just hang and not give me a console.I want to collect more logs and see if the adapter is able to reach the peer to see if it is possibly a timing issue. Is there a proper way to drop to a shell in case of a kdump failure? Also I am attempting to reinstall Ubuntu 18.04 to reattempt kdump a few more times to make sure I didn't get lucky, but I am hitting IBM Bug 165336 - Canonical LP 1753449, which is preventing from reinstalling Ubuntu 18.04. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1681909 Title: dump is not captured in remote host when kdump over ssh is configured on firestone. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1681909] Comment bridged from LTC Bugzilla
--- Comment From mu...@br.ibm.com 2018-03-05 08:49 EDT--- (In reply to comment #40) > Can you try 16.04? > > Thanks. > Cascardo. Cascardo, sure, I'll give it a try and report back the test results as soon as possible. Best regards, Murilo -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1681909 Title: Ubuntu 17.04: dump is not captured in remote host when kdump over ssh is configured on firestone. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1681909] Comment bridged from LTC Bugzilla
--- Comment From mu...@br.ibm.com 2018-02-26 13:57 EDT--- I tried reproducing this issue in a Firestone system our team owns with Ubuntu 18.04 and I couldn't reproduce the issue. root@ltc-fire1:~# cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=18.04 DISTRIB_CODENAME=bionic DISTRIB_DESCRIPTION="Ubuntu Bionic Beaver (development branch)" root@ltc-fire1:~# uname -r 4.15.0-10-generic root@ltc-fire1:~# ethtool -i enP1p1s0f0 | grep "driver\|firmware\|bus-info" driver: tg3 firmware-version: 5719-v1.38i bus-info: 0001:01:00.0 root@ltc-fire1:~# lspci -vmmnn -s 0001:01:00.0 Slot: 0001:01:00.0 Class: Ethernet controller [0200] Vendor: Broadcom Limited [14e4] Device: NetXtreme BCM5719 Gigabit Ethernet PCIe [1657] SVendor:IBM [1014] SDevice:NetXtreme BCM5719 Gigabit Ethernet PCIe (FC 5260/5899 4-port 1 GbE Adapter for Power) [0420] Rev:01 NUMANode: 0 Here is a snippet of the kdump attempt to reproduce the issue: [ 129.602468] kdump-tools[1599]: Starting kdump-tools: * sending makedumpfile -c -d 31 -F /proc/vmcore to root@9.40.194.212 : /var/crash/9.40.195.135-201802261141/dump-incomplete [ 129.719688] kdump-tools[1599]: The kernel version is not supported. [ 129.720035] kdump-tools[1599]: The makedumpfile operation may be incomplete. Copying data : [100.0 %] \ eta: 0s [ 144.173303] kdump-tools[1599]: The dumpfile is saved to STDOUT. [ 144.173531] kdump-tools[1599]: makedumpfile Completed. [ 144.184688] kdump-tools[1599]: 533781+259 records in [ 144.184975] kdump-tools[1599]: 533918+1 records out [ 144.185223] kdump-tools[1599]: 273366297 bytes (273 MB, 261 MiB) copied, 14.3426 s, 19.1 MB/s [ 144.419378] kdump-tools[1599]: * kdump-tools: saved vmcore in root@9.40.194.212:/var/crash/9.40.195.135-201802261141 [ 144.439183] kdump-tools[1599]: * running makedumpfile --dump-dmesg /proc/vmcore /tmp/dmesg.201802261141 [ 144.445902] kdump-tools[1599]: The kernel version is not supported. [ 144.446266] kdump-tools[1599]: The makedumpfile operation may be incomplete. [ 144.446557] kdump-tools[1599]: The dmesg log is saved to /tmp/dmesg.201802261141. [ 144.446844] kdump-tools[1599]: makedumpfile Completed. [ 144.717033] kdump-tools[1599]: * kdump-tools: saved dmesg content in root@9.40.194.212:/var/crash/9.40.195.135-201802261141 [ 144.718981] kdump-tools[1599]: Mon, 26 Feb 2018 11:42:11 -0700 [ 144.825931] kdump-tools[1599]: Rebooting. [ 144.950195] reboot: Restarting system Both dmesg and dump file were transferred to the peer under /var/crash/ root@ltc-zz4-lp2:~# ls /var/crash/9.40.195.135-201802261141 dmesg.201802261141 dump.201802261141 Pavithra, if you still have the system, could you attempt to reproduce this issue in your environment? --- Comment From pavra...@in.ibm.com 2018-02-28 01:46 EDT--- Issue is not observed on same machine with 17.10 and 18.04 on same machine. we can close the bug. 18.04 Starting Kernel crash dump capture service... [ 29.664816] kdump-tools[1255]: Starting kdump-tools: * sending makedumpfile -c -d 31 -F /proc/vmcore to root@9.40.192.198 : /var/crash/9.47.70.29-201802280145/dump-incomplete [ 29.732510] kdump-tools[1255]: The kernel version is not supported. [ 29.732618] kdump-tools[1255]: The makedumpfile operation may be incomplete. Copying data : [100.0 %] / eta: 0s [ 41.966840] kdump-tools[1255]: The dumpfile is saved to STDOUT. [ 41.966939] kdump-tools[1255]: makedumpfile Completed. [ 42.112371] kdump-tools[1255]: 247581+447 records in [ 42.112456] kdump-tools[1255]: 247802+1 records out [ 42.112557] kdump-tools[1255]: 126874775 bytes (127 MB) copied, 11.3072 s, 11.2 MB/s [ 43.275938] kdump-tools[1255]: * kdump-tools: saved vmcore in root@9.40.192.198:/var/crash/9.47.70.29-201802280145 [ 43.295066] kdump-tools[1255]: * running makedumpfile --dump-dmesg /proc/vmcore /tmp/dmesg.201802280145 [ 43.302493] kdump-tools[1255]: The kernel version is not supported. [ 43.302611] kdump-tools[1255]: The makedumpfile operation may be incomplete. [ 43.302694] kdump-tools[1255]: The dmesg log is saved to /tmp/dmesg.201802280145. [ 43.302804] kdump-tools[1255]: makedumpfile Completed. [ 44.711325] kdump-tools[1255]: * kdump-tools: saved dmesg content in root@9.40.192.198:/var/crash/9.47.70.29-201802280145 [ 44.713050] kdump-tools[1255]: Wed, 28 Feb 2018 01:45:29 -0500 [ 44.840303] kdump-tools[1255]: Rebooting. 17.10 == [root@lep8a crash]# ls 9.47.70.29-201802280052 [root@lep8a crash]# cd 9.47.70.29-201802280052 [root@lep8a 9.47.70.29-201802280052]# ls dmesg.201802280052 dump.201802280052 [root@lep8a 9.47.70.29-201802280052]# tail dmesg.201802280052 [ 1133.042160] Instruction dump: [ 1133.042192] 4bfff9f1 4bfffe50 3c4c00e6 384228e0 7c0802a6 6000 3921 3d42001d [ 1133.042410] 394adb30 912a 7c0004ac 3940 <992a> 4e800020 3c4c00e6 384228b0 [ 11
[Bug 1681909] Comment bridged from LTC Bugzilla
--- Comment From hbath...@in.ibm.com 2017-11-17 03:21 EDT--- (In reply to comment #32) > Hi, I am a little eager to add this without trying to resort to other > solutions first. > > So, options are: > > 1) For some reason, this driver is not behaving correctly. Can you add the > PowerIO folks to this bug on IBM side and let them do some investigation? done. > > 2) network-online is not doing the correct thing. Well, from what I read, > they indeed don't care much about this and think the program should wait for > the network to be available. After eliminating 1, we should look into why > network-online decides the network is online or why systemd would start > kdump after that, and the ssh host would still not be reachable. > > 3) I would rather add the timeout but also conditionally checking for the > host availability. That is: wait until it's available, then dump. If not > available for the timeout duration, reboot. Sounds reasonable. Thanks Hari -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1681909 Title: Ubuntu 17.04: dump is not captured in remote host when kdump over ssh is configured on firestone. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1681909] Comment bridged from LTC Bugzilla
--- Comment From hbath...@in.ibm.com 2017-07-31 14:14 EDT--- (In reply to comment #30) > I don't see any fundamental issue with providing a NET_WAIT_TIME variable > (probably should be namespaced to KDUMP_) in the kdump config file, but: Right, KDUMP_NET_WAIT_TIME is better.. > > 1) this seems like a hack to work around slow hardware, right? > Yeah. On NICs that are slow to initialize. With my limited expertise in network related problems, I thought this can be a nice config option to have. A right fix might be somewhere in network related stuff.. > 2) it can't be automatically deduced, afaict. Or do you want to have 30s > delays (potentially) on all POWER machines? > No. I think this has more to do with the NIC than arch. What I have in mind is a 0s delay time by default but something that can be set to a non-zero value for NICs like this using KDUMP_NET_WAIT_TIME= > 3) I'm not 100% familiar with the 'upstream' of kdump-tools -- is this > something that we'd need to carry forever in the Debian/Ubuntu packaging? Probably, unless there is a fix in NIC (hardware/firmware) and/or network related code that makes this config option redundant.. Thanks Hari -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1681909 Title: Ubuntu 17.04: dump is not captured in remote host when kdump over ssh is configured on firestone. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1681909] Comment bridged from LTC Bugzilla
--- Comment From hbath...@in.ibm.com 2017-06-21 10:28 EDT--- Canonical, any take on introducing NET_WAIT_TIME in /etc/default/kdump-tools file to deal with timing issues on some NICs? Thanks Hari -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1681909 Title: Ubuntu 17.04: dump is not captured in remote host when kdump over ssh is configured on firestone. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs