This is the debdiff with the retry/delay mechanism, for Eoan. I've discussed with Cascardo and we agreed he will do the SRU to old releases (X/B/C/D) after applying some other SRUs he's working now.
I'd like to thanks specially Hari, Murilo and Pavithra from IBM, that reported, worked and proposed a solution for this issue! ** Description changed: - == Comment: #0 - PAVITHRA R. PRAKASH <[email protected]> - 2017-03-07 05:00:29 == - ---Problem Description--- + [Impact] - Ubuntu 17.04: dump is not captured in remote host when kdump over ssh is - configured on firestone. + * Kdump over network (like NFS mount or SSH dump) relies on network- + online target from systemd. Even so, there are some NICs that report + "Link Up" state but aren't ready to transmit packets. This is a + generally bad behavior that is credited probably to NIC firmware delays, + usually not fixable from drivers. Some adapters known to act like this + are bnx2x, tg3 and ixgbe. - ---Steps to Reproduce--- + * Kdump is a mechanism that may be a last resort to debug complex/hard + to reproduce issues, so it's interesting to increase its reliability / + resilience. We then propose here a solution/quirk to this issue on + network dump by adding a retry/delay mechanism; if it's a network dump, + kdump will retry some times and sleep between the attempts in order to + exclude the case of NICs that aren't ready yet but will soon be able to + transmit packets. - 1. Configure kdump. - 2. Check whether kdump is operational using ?# kdump-config show?. - 3. Install ?kernel-debuginfo? and ?kernel-debuginfo-common? rpms. - 4. Setup password less ssh connection, generate rsa key. - # ssh-keygen -t rsa - 5. verify id_rsa and id_rsa.pub are created under /root/.ssh/ - 6. Edit /etc/default/kdump-tools and add below entries. - SSH="[email protected]" - SSH_KEY=/root/.ssh/id_rsa - 7. Propagate RSA key. - # kdump-config propagate - 8. Restart kdump service. - # kdump-config load - 9. Trigger Crash using below commands. - # echo "1" > /proc/sys/kernel/sysrq - # echo "c" > /proc/sysrq-trigger - 10. Verify dump is available in remote server in configured path. + * Although first reported by IBM in PowerPC arch, the scope for this + issue is the NIC, and it was later reported in x86 arch too. - Machine details - =========== + [Test case] - $ ipmitool -I lanplus -H 9.47.70.3 -U ADMIN -P admin sol activate + Usually it's difficult to naturally reproduce this issue in a deterministic way, but we have an artificial test case on comment #24 of this LP. + Also, we have a report from this bug in which the user managed to reproduce the problem consistently - it's fixed after testing our solution. - $ ssh [email protected] + [Regression potential] - PW: shriya101 - - - Attaching logs - - == Comment: #1 - PAVITHRA R. PRAKASH <[email protected]> - 2017-03-07 - 05:01:42 == - - - == Comment: #5 - PAVITHRA R. PRAKASH <[email protected]> - 2017-03-07 23:19:46 == - Hi, - - Attaching the logs. - - Network info: - - root@ltc-firep3:~# hwinfo --network - 36: None 00.0: 10700 Loopback - [Created at net.126] - Unique ID: ZsBS.GQNx7L4uPNA - SysFS ID: /class/net/lo - Hardware Class: network interface - Model: "Loopback network interface" - Device File: lo - Link detected: yes - Config Status: cfg=new, avail=yes, need=no, active=unknown - - 37: None 00.0: 10701 Ethernet - [Created at net.126] - Unique ID: 2lHw.ndpeucax6V1 - Parent ID: mIXc.aXC4wIvegH8 - SysFS ID: /class/net/enP33p3s0f2 - SysFS Device Link: /devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.2 - Hardware Class: network interface - Model: "Ethernet network interface" - Driver: "tg3" - Driver Modules: "tg3" - Device File: enP33p3s0f2 - HW Address: 98:be:94:03:18:4a - Permanent HW Address: 98:be:94:03:18:4a - Link detected: no - Config Status: cfg=new, avail=yes, need=no, active=unknown - Attached to: #15 (Ethernet controller) - - 38: None 00.0: 10701 Ethernet - [Created at net.126] - Unique ID: 7Onn.ndpeucax6V1 - Parent ID: sx0U.aXC4wIvegH8 - SysFS ID: /class/net/enP33p3s0f0 - SysFS Device Link: /devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.0 - Hardware Class: network interface - Model: "Ethernet network interface" - Driver: "tg3" - Driver Modules: "tg3" - Device File: enP33p3s0f0 - HW Address: 98:be:94:03:18:48 - Permanent HW Address: 98:be:94:03:18:48 - Link detected: yes - Config Status: cfg=new, avail=yes, need=no, active=unknown - Attached to: #16 (Ethernet controller) - - 39: None 00.0: 10701 Ethernet - [Created at net.126] - Unique ID: VwX_.ndpeucax6V1 - Parent ID: DUng.aXC4wIvegH8 - SysFS ID: /class/net/enP33p3s0f3 - SysFS Device Link: /devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.3 - Hardware Class: network interface - Model: "Ethernet network interface" - Driver: "tg3" - Driver Modules: "tg3" - Device File: enP33p3s0f3 - HW Address: 98:be:94:03:18:4b - Permanent HW Address: 98:be:94:03:18:4b - Link detected: no - Config Status: cfg=new, avail=yes, need=no, active=unknown - Attached to: #25 (Ethernet controller) - - 40: None 00.0: 10701 Ethernet - [Created at net.126] - Unique ID: bZ1s.ndpeucax6V1 - Parent ID: J7HY.aXC4wIvegH8 - SysFS ID: /class/net/enP33p3s0f1 - SysFS Device Link: /devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.1 - Hardware Class: network interface - Model: "Ethernet network interface" - Driver: "tg3" - Driver Modules: "tg3" - Device File: enP33p3s0f1 - HW Address: 98:be:94:03:18:49 - Permanent HW Address: 98:be:94:03:18:49 - Link detected: no - Config Status: cfg=new, avail=yes, need=no, active=unknown - Attached to: #4 (Ethernet controller) - root@ltc-firep3:~# - - - Thanks, - Pavithra - - == Comment: #6 - PAVITHRA R. PRAKASH <[email protected]> - 2017-03-07 - 23:20:47 == - - - == Comment: #7 - PAVITHRA R. PRAKASH <[email protected]> - 2017-03-07 23:21:27 == - - - == Comment: #8 - Urvashi Jawere <[email protected]> - 2017-03-08 02:48:15 == - I am able to see some errors in syslog ; - - auxiliary - Mar 7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed for question 114.15.239:/home/ubuntu/test IN SOA: failed-auxiliary - Mar 7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed for question 9.114.15.239:/home/ubuntu/test IN DS: failed-auxiliary - Mar 7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed for question 9.114.15.239:/home/ubuntu/test IN SOA: failed-auxiliary - Mar 7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed for question 9.114.15.239:/home/ubuntu/test IN A: failed-auxiliary - Mar 7 04:57:44 ltc-firep3 systemd-resolved[3486]: Server 9.12.16.2 does not support DNSSEC, downgrading to non-DNSSEC mode. - Mar 7 04:57:44 ltc-firep3 kdump-config: /root/.ssh/id_rsa failed to be sent to [email protected]:/home/ubuntu/test - Mar 7 04:58:04 ltc-firep3 systemd[1]: Reloading. - Mar 7 04:59:15 ltc-firep3 systemd[1]: Reloading. - Mar 7 04:59:16 ltc-firep3 kdump-config: propagated ssh key /root/.ssh/id_rsa to server [email protected] - . - . - . - - Mar 7 05:06:55 ltc-firep3 systemd[1]: Started Accounts Service. - Mar 7 05:06:56 ltc-firep3 kdump-tools[3498]: Starting kdump-tools: Modified cmdline:root=UUID=1e76cfd5-988c-46f4-bdc4-39fe1ed01152 ro quiet splash irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service ata_piix.prefer_ms_hyperv=0 elfcorehdr=155136K - Mar 7 05:06:57 ltc-firep3 kdump-tools[3498]: * loaded kdump kernel - Mar 7 05:06:57 ltc-firep3 kdump-tools: /sbin/kexec -p --command-line="root=UUID=1e76cfd5-988c-46f4-bdc4-39fe1ed01152 ro quiet splash irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service ata_piix.prefer_ms_hyperv=0" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz - Mar 7 05:06:57 ltc-firep3 kdump-tools: loaded kdump kernel - Mar 7 05:06:57 ltc-firep3 systemd[1]: Started Kernel crash dump capture service. - Mar 7 05:06:57 ltc-firep3 apport[3584]: ERROR: Cannot create report: [Errno 17] File exists: '/var/crash/linux-image-4.10.0-9-generic-201703060521.crash' - Mar 7 05:06:57 ltc-firep3 apport[3584]: ...done. - - == Comment: #18 - Hari Krishna Bathini <[email protected]> - 2017-03-28 06:55:20 == - Looks like tg3 module was not needed after all. Interesting thing though is - even after enP34p1s0f0 is up (ifup) and network.online target is reached, - network was not really active. It took about 30 seconds, after reaching - network.online target, for the network to be active, even on a normal boot. - Adding this wait time in kdump script, before saving dump, ensured that - vmcore is captured successful. Attaching the log for the same.. - - Not sure why enP34p1s0f0 is taking that long to configure/initialize. Even so, - this delay should be part of ifup/network-online.target if it is inevitable, - so that network is pingable after network-online.target - - Thanks - Hari - - == Comment: #19 - Hari Krishna Bathini <[email protected]> - 2017-03-28 07:01:52 == - The workaround snippet adding delay in kdump script: - - - --- kdump-config.orig 2017-03-28 03:35:17.753542107 -0500 - +++ kdump-config 2017-03-28 06:59:22.887576623 -0500 - @@ -761,6 +761,7 @@ - KDUMP_DMESGFILE="$KDUMP_STAMPDIR/dmesg.$KDUMP_STAMP" - ERROR=0 - - + sleep 30 - ssh -i $KDUMP_SSH_KEY $KDUMP_REMOTE_HOST mkdir -p $KDUMP_STAMPDIR - ERROR=$? - # If remote connections fails, no need to continue - - --- - - Thanks - Hari - - == Comment: #20 - PAVITHRA R. PRAKASH <[email protected]> - 2017-03-30 01:33:56 == - (In reply to comment #19) - > The workaround snippet adding delay in kdump script: - > - > - > --- kdump-config.orig 2017-03-28 03:35:17.753542107 -0500 - > +++ kdump-config 2017-03-28 06:59:22.887576623 -0500 - > @@ -761,6 +761,7 @@ - > KDUMP_DMESGFILE="$KDUMP_STAMPDIR/dmesg.$KDUMP_STAMP" - > ERROR=0 - > - > + sleep 30 - > ssh -i $KDUMP_SSH_KEY $KDUMP_REMOTE_HOST mkdir -p $KDUMP_STAMPDIR - > ERROR=$? - > # If remote connections fails, no need to continue - > - > --- - > - > Thanks - > Hari - - With above workaround dump captured successfully in remote host. - - Thanks, - Pavithra - - == Comment: #22 - Hari Krishna Bathini <[email protected]> - 2017-04-10 22:14:27 == - (In reply to comment #18) - > Created attachment 117088 [details] - > Console log of successful dump capture after adding a time delay of 'sleep - > 30' - > - > Looks like tg3 module was not needed after all. Interesting thing though is - > even after enP34p1s0f0 is up (ifup) and network.online target is reached, - > network was not really active. It took about 30 seconds, after reaching - > network.online target, for the network to be active, even on a normal boot. - > Adding this wait time in kdump script, before saving dump, ensured that - > vmcore is captured successful. Attaching the log for the same.. - > - > Not sure why enP34p1s0f0 is taking that long to configure/initialize. Even - > so, - > this delay should be part of ifup/network-online.target if it is inevitable, - > so that network is pingable after network-online.target - - Hi Canonical, - - Since this falls outside the realm of kdump, should we add a NET_WAIT_TIME field - in /etc/default/kdump-tools file that defaults to 0 but can be changed when the - user sees timing troubles? - - Thanks - Hari + There's not a clear regression potential here since it's just a retry/delay mechanism. Some potential problems may come from bad coding in the script. + The delay between attempts is only 3 sec per iteration, so it shouldn't block the kdump progress for a high amount of time at once. ** Patch added: "lp1681909_eoan.debdiff" https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+attachment/5275117/+files/lp1681909_eoan.debdiff ** Changed in: makedumpfile (Ubuntu Xenial) Status: Confirmed => In Progress ** Changed in: makedumpfile (Ubuntu Bionic) Status: Confirmed => In Progress ** Changed in: makedumpfile (Ubuntu Cosmic) Status: Confirmed => In Progress ** Changed in: makedumpfile (Ubuntu Disco) Status: Confirmed => In Progress ** Changed in: makedumpfile (Ubuntu Eoan) Status: Confirmed => In Progress ** Changed in: makedumpfile (Ubuntu Disco) Assignee: Guilherme G. Piccoli (gpiccoli) => Thadeu Lima de Souza Cascardo (cascardo) ** Changed in: makedumpfile (Ubuntu Cosmic) Assignee: Guilherme G. Piccoli (gpiccoli) => Thadeu Lima de Souza Cascardo (cascardo) ** Changed in: makedumpfile (Ubuntu Bionic) Assignee: Guilherme G. Piccoli (gpiccoli) => Thadeu Lima de Souza Cascardo (cascardo) ** Changed in: makedumpfile (Ubuntu Xenial) Assignee: Guilherme G. Piccoli (gpiccoli) => Thadeu Lima de Souza Cascardo (cascardo) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1681909 Title: kdump is not captured in remote host when kdump over ssh is configured To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
