[Bug 1681909] Comment bridged from LTC Bugzilla

2018-03-06 Thread bugproxy
--- Comment From mu...@br.ibm.com 2018-03-06 11:15 EDT---
(In reply to comment #45)
> Hi, Murilo.
>
> Can you test it on 16.04 using kdump-tools from xenial-proposed? Maybe the
> noirqdistrib option might be related to the EEH issues.
>

Ok, I'll give it a try.

(In reply to comment #46)
> Looking at the log, I noticed the EEH is frozen right after finding the
> Broadcom card. Is that one the tg3?
>
> [  OK  ] Found device NetXtreme BCM5719 Gigabit Ethernet PCIe.
> [8.191135] EEH: Frozen PE#7 on PHB#21 detected
> [8.191280] EEH: PE location: S00210f, PHB location: N/A

Yeah correct, this is the tg3 device. But the EEH is seen in a PHB
different then the one the adapter is in. This adapter is PHB#01, where
the EEH is seen in the PHB#21.

>
> Also, the recovery problem seems to be caused by ast.
>
> [   18.267005] EEH: 210 reads ignored for recovering device at
> location=S00210f driver=ast pci addr=0021:10:00.0
> [   18.267334] EEH: Might be infinite loop in ast driver
>
> Looking at the upstream logs, one commit came up. Can you open a new bug for
> it?
>
> commit 298360af3dab45659810fdc51aba0c9f4097e4f6
> Author: Russell Currey 
> Date:   Thu Dec 15 16:12:41 2016 +1100
>
> drivers/gpu/drm/ast: Fix infinite loop if read fails

Cascardo, about the mentioned patch, it is already in this kernel, when I look 
at the changelog for linux-image-4.4.0-116-generic:
* Xenial update to v4.4.41 stable release (LP: #1655041)
- drivers/gpu/drm/ast: Fix infinite loop if read fails

And also this is not the only device that is hitting the EEH, when I
blacklisted the ast module I still see the EEH hitting the other slots
behind the PLX switch

I was able to collect a full dmesg output by adding the dmesg command to
the KDUMP_FAIL_CMD option, still no luck in getting it to drop to a
shell.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1681909

Title:
  dump is not captured in remote host when kdump over ssh is configured
  on firestone.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1681909] Comment bridged from LTC Bugzilla

2018-03-06 Thread bugproxy
--- Comment From mu...@br.ibm.com 2018-03-06 09:38 EDT---
Cascardo, I gave a try with kdump in Ubuntu 16.04 and it seems to occasionally 
fail.

It seems to be kind of random when it decides to fail, I see that we are
hitting an EEH in slots behind a PLX switch, but even in successful
attempts we hit the EEH as well. So not sure how much it is related. I
collected the console log of the failure attempt (I am attaching it).

I am attempting to drop into a shell by setting sh or bash to run in the
KDUMP_FAIL_CMD option, but it seems to just hang and not give me a
console.I want to collect more logs and see if the adapter is able to
reach the peer to see if it is possibly a timing issue. Is there a
proper way to drop to a shell in case of a kdump failure?

Also I am attempting to reinstall Ubuntu 18.04 to reattempt kdump a few
more times to make sure I didn't get lucky, but I am hitting IBM Bug
165336 - Canonical LP 1753449, which is preventing from reinstalling
Ubuntu 18.04.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1681909

Title:
  dump is not captured in remote host when kdump over ssh is configured
  on firestone.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1681909] Comment bridged from LTC Bugzilla

2018-03-05 Thread bugproxy
--- Comment From mu...@br.ibm.com 2018-03-05 08:49 EDT---
(In reply to comment #40)
> Can you try 16.04?
>
> Thanks.
> Cascardo.

Cascardo, sure, I'll give it a try and report back the test results as
soon as possible.

Best regards,
Murilo

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1681909

Title:
  Ubuntu 17.04: dump is not captured in remote host when kdump over ssh
  is configured on firestone.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1681909] Comment bridged from LTC Bugzilla

2018-03-05 Thread bugproxy
--- Comment From mu...@br.ibm.com 2018-02-26 13:57 EDT---
I tried reproducing this issue in a Firestone system our team owns with Ubuntu 
18.04 and I couldn't reproduce the issue.

root@ltc-fire1:~# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu Bionic Beaver (development branch)"

root@ltc-fire1:~# uname -r
4.15.0-10-generic

root@ltc-fire1:~# ethtool -i enP1p1s0f0 | grep "driver\|firmware\|bus-info"
driver: tg3
firmware-version: 5719-v1.38i
bus-info: 0001:01:00.0

root@ltc-fire1:~# lspci -vmmnn -s 0001:01:00.0
Slot:   0001:01:00.0
Class:  Ethernet controller [0200]
Vendor: Broadcom Limited [14e4]
Device: NetXtreme BCM5719 Gigabit Ethernet PCIe [1657]
SVendor:IBM [1014]
SDevice:NetXtreme BCM5719 Gigabit Ethernet PCIe (FC 5260/5899 4-port 1 
GbE Adapter for Power) [0420]
Rev:01
NUMANode:   0

Here is a snippet of the kdump attempt to reproduce the issue:

[  129.602468] kdump-tools[1599]: Starting kdump-tools:  * sending makedumpfile 
-c -d 31 -F /proc/vmcore to root@9.40.194.212 : 
/var/crash/9.40.195.135-201802261141/dump-incomplete
[  129.719688] kdump-tools[1599]: The kernel version is not supported.
[  129.720035] kdump-tools[1599]: The makedumpfile operation may be incomplete.
Copying data  : [100.0 %] \   eta: 
0s
[  144.173303] kdump-tools[1599]: The dumpfile is saved to STDOUT.
[  144.173531] kdump-tools[1599]: makedumpfile Completed.
[  144.184688] kdump-tools[1599]: 533781+259 records in
[  144.184975] kdump-tools[1599]: 533918+1 records out
[  144.185223] kdump-tools[1599]: 273366297 bytes (273 MB, 261 MiB) copied, 
14.3426 s, 19.1 MB/s
[  144.419378] kdump-tools[1599]:  * kdump-tools: saved vmcore in 
root@9.40.194.212:/var/crash/9.40.195.135-201802261141
[  144.439183] kdump-tools[1599]:  * running makedumpfile --dump-dmesg 
/proc/vmcore /tmp/dmesg.201802261141
[  144.445902] kdump-tools[1599]: The kernel version is not supported.
[  144.446266] kdump-tools[1599]: The makedumpfile operation may be incomplete.
[  144.446557] kdump-tools[1599]: The dmesg log is saved to 
/tmp/dmesg.201802261141.
[  144.446844] kdump-tools[1599]: makedumpfile Completed.
[  144.717033] kdump-tools[1599]:  * kdump-tools: saved dmesg content in 
root@9.40.194.212:/var/crash/9.40.195.135-201802261141
[  144.718981] kdump-tools[1599]: Mon, 26 Feb 2018 11:42:11 -0700
[  144.825931] kdump-tools[1599]: Rebooting.
[  144.950195] reboot: Restarting system

Both dmesg and dump file were transferred to the peer under /var/crash/

root@ltc-zz4-lp2:~# ls /var/crash/9.40.195.135-201802261141
dmesg.201802261141  dump.201802261141

Pavithra, if you still have the system, could you attempt to reproduce
this issue in your environment?

--- Comment From pavra...@in.ibm.com 2018-02-28 01:46 EDT---
Issue is not observed on same machine with 17.10 and 18.04 on same machine.

we can close the bug.

18.04


Starting Kernel crash dump capture service...
[   29.664816] kdump-tools[1255]: Starting kdump-tools:  * sending makedumpfile 
-c -d 31 -F /proc/vmcore to root@9.40.192.198 : 
/var/crash/9.47.70.29-201802280145/dump-incomplete
[   29.732510] kdump-tools[1255]: The kernel version is not supported.
[   29.732618] kdump-tools[1255]: The makedumpfile operation may be incomplete.
Copying data  : [100.0 %] /   eta: 
0s
[   41.966840] kdump-tools[1255]: The dumpfile is saved to STDOUT.
[   41.966939] kdump-tools[1255]: makedumpfile Completed.
[   42.112371] kdump-tools[1255]: 247581+447 records in
[   42.112456] kdump-tools[1255]: 247802+1 records out
[   42.112557] kdump-tools[1255]: 126874775 bytes (127 MB) copied, 11.3072 s, 
11.2 MB/s
[   43.275938] kdump-tools[1255]:  * kdump-tools: saved vmcore in 
root@9.40.192.198:/var/crash/9.47.70.29-201802280145
[   43.295066] kdump-tools[1255]:  * running makedumpfile --dump-dmesg 
/proc/vmcore /tmp/dmesg.201802280145
[   43.302493] kdump-tools[1255]: The kernel version is not supported.
[   43.302611] kdump-tools[1255]: The makedumpfile operation may be incomplete.
[   43.302694] kdump-tools[1255]: The dmesg log is saved to 
/tmp/dmesg.201802280145.
[   43.302804] kdump-tools[1255]: makedumpfile Completed.
[   44.711325] kdump-tools[1255]:  * kdump-tools: saved dmesg content in 
root@9.40.192.198:/var/crash/9.47.70.29-201802280145
[   44.713050] kdump-tools[1255]: Wed, 28 Feb 2018 01:45:29 -0500
[   44.840303] kdump-tools[1255]: Rebooting.

17.10
==

[root@lep8a crash]# ls
9.47.70.29-201802280052
[root@lep8a crash]# cd 9.47.70.29-201802280052
[root@lep8a 9.47.70.29-201802280052]# ls
dmesg.201802280052  dump.201802280052
[root@lep8a 9.47.70.29-201802280052]# tail dmesg.201802280052
[ 1133.042160] Instruction dump:
[ 1133.042192] 4bfff9f1 4bfffe50 3c4c00e6 384228e0 7c0802a6 6000 3921 
3d42001d
[ 1133.042410] 394adb30 912a 7c0004ac 3940 <992a> 4e800020 3c4c00e6 
384228b0
[ 11

[Bug 1681909] Comment bridged from LTC Bugzilla

2017-11-17 Thread bugproxy
--- Comment From hbath...@in.ibm.com 2017-11-17 03:21 EDT---
(In reply to comment #32)
> Hi, I am a little eager to add this without trying to resort to other
> solutions first.
>
> So, options are:
>
> 1) For some reason, this driver is not behaving correctly. Can you add the
> PowerIO folks to this bug on IBM side and let them do some investigation?

done.

>
> 2) network-online is not doing the correct thing. Well, from what I read,
> they indeed don't care much about this and think the program should wait for
> the network to be available. After eliminating 1, we should look into why
> network-online decides the network is online or why systemd would start
> kdump after that, and the ssh host would still not be reachable.
>

> 3) I would rather add the timeout but also conditionally checking for the
> host availability. That is: wait until it's available, then dump. If not
> available for the timeout duration, reboot.

Sounds reasonable.

Thanks
Hari

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1681909

Title:
  Ubuntu 17.04: dump is not captured in remote host when kdump over ssh
  is configured on firestone.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1681909] Comment bridged from LTC Bugzilla

2017-07-31 Thread bugproxy
--- Comment From hbath...@in.ibm.com 2017-07-31 14:14 EDT---
(In reply to comment #30)
> I don't see any fundamental issue with providing a NET_WAIT_TIME variable
> (probably should be namespaced to KDUMP_) in the kdump config file, but:

Right, KDUMP_NET_WAIT_TIME is better..

>
> 1) this seems like a hack to work around slow hardware, right?
>

Yeah. On NICs that are slow to initialize. With my limited expertise
in network related problems, I thought this can be a nice config option
to have. A right fix might be somewhere in network related stuff..

> 2) it can't be automatically deduced, afaict. Or do you want to have 30s
> delays (potentially) on all POWER machines?
>

No. I think this has more to do with the NIC than arch. What I have in mind
is a 0s delay time by default but something that can be set
to a non-zero value for NICs like this using KDUMP_NET_WAIT_TIME=

> 3) I'm not 100% familiar with the 'upstream' of kdump-tools -- is this
> something that we'd need to carry forever in the Debian/Ubuntu packaging?

Probably, unless there is a fix in NIC (hardware/firmware) and/or network 
related code
that makes this config option redundant..

Thanks
Hari

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1681909

Title:
  Ubuntu 17.04: dump is not captured in remote host when kdump over ssh
  is configured on firestone.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 1681909] Comment bridged from LTC Bugzilla

2017-06-21 Thread bugproxy
--- Comment From hbath...@in.ibm.com 2017-06-21 10:28 EDT---
Canonical, any take on introducing NET_WAIT_TIME in /etc/default/kdump-tools 
file
to deal with timing issues on some NICs?

Thanks
Hari

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1681909

Title:
  Ubuntu 17.04: dump is not captured in remote host when kdump over ssh
  is configured on firestone.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs