This is the debdiff with the retry/delay mechanism, for Eoan. I've
discussed with Cascardo and we agreed he will do the SRU to old releases
(X/B/C/D) after applying some other SRUs he's working now.

I'd like to thanks specially Hari, Murilo and Pavithra from IBM, that
reported, worked and proposed a solution for this issue!

** Description changed:

- == Comment: #0 - PAVITHRA R. PRAKASH <[email protected]> - 2017-03-07 
05:00:29 ==
- ---Problem Description---
+ [Impact]
  
- Ubuntu 17.04: dump is not captured in remote host when kdump over ssh is
- configured on firestone.
+ * Kdump over network (like NFS mount or SSH dump) relies on network-
+ online target from systemd. Even so, there are some NICs that report
+ "Link Up" state but aren't ready to transmit packets. This is a
+ generally bad behavior that is credited probably to NIC firmware delays,
+ usually not fixable from drivers. Some adapters known to act like this
+ are bnx2x, tg3 and ixgbe.
  
- ---Steps to Reproduce---
+ * Kdump is a mechanism that may be a last resort to debug complex/hard
+ to reproduce issues, so it's interesting to increase its reliability /
+ resilience. We then propose here a solution/quirk to this issue on
+ network dump by adding a retry/delay mechanism; if it's a network dump,
+ kdump will retry some times and sleep between the attempts in order to
+ exclude the case of NICs that aren't ready yet but will soon be able to
+ transmit packets.
  
- 1. Configure kdump.
- 2. Check whether kdump is operational using ?# kdump-config show?.
- 3. Install ?kernel-debuginfo? and ?kernel-debuginfo-common? rpms.
- 4. Setup password less ssh connection, generate rsa key.
- # ssh-keygen -t rsa
- 5. verify id_rsa and id_rsa.pub are created under /root/.ssh/
- 6. Edit /etc/default/kdump-tools and add below entries.
- SSH="[email protected]"
- SSH_KEY=/root/.ssh/id_rsa
- 7. Propagate RSA key.
- # kdump-config propagate
- 8. Restart kdump service.
- # kdump-config load
- 9. Trigger Crash using below commands.
- # echo "1" > /proc/sys/kernel/sysrq
- # echo "c" > /proc/sysrq-trigger
- 10. Verify dump is available in remote server in configured path.
+ * Although first reported by IBM in PowerPC arch, the scope for this
+ issue is the NIC, and it was later reported in x86 arch too.
  
- Machine details
- ===========
+ [Test case]
  
- $ ipmitool -I lanplus -H  9.47.70.3 -U ADMIN -P admin sol activate
+ Usually it's difficult to naturally reproduce this issue in a deterministic 
way, but we have an artificial test case on comment #24 of this LP.
+ Also, we have a report from this bug in which the user managed to reproduce 
the problem consistently - it's fixed after testing our solution.
  
- $ ssh [email protected]
+ [Regression potential]
  
- PW: shriya101
- 
- 
- Attaching logs
- 
- == Comment: #1 - PAVITHRA R. PRAKASH <[email protected]> - 2017-03-07
- 05:01:42 ==
- 
- 
- == Comment: #5 - PAVITHRA R. PRAKASH <[email protected]> - 2017-03-07 
23:19:46 ==
- Hi, 
- 
- Attaching the logs.
- 
- Network info:
- 
- root@ltc-firep3:~# hwinfo --network
- 36: None 00.0: 10700 Loopback                                   
-   [Created at net.126]
-   Unique ID: ZsBS.GQNx7L4uPNA
-   SysFS ID: /class/net/lo
-   Hardware Class: network interface
-   Model: "Loopback network interface"
-   Device File: lo
-   Link detected: yes
-   Config Status: cfg=new, avail=yes, need=no, active=unknown
- 
- 37: None 00.0: 10701 Ethernet
-   [Created at net.126]
-   Unique ID: 2lHw.ndpeucax6V1
-   Parent ID: mIXc.aXC4wIvegH8
-   SysFS ID: /class/net/enP33p3s0f2
-   SysFS Device Link: 
/devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.2
-   Hardware Class: network interface
-   Model: "Ethernet network interface"
-   Driver: "tg3"
-   Driver Modules: "tg3"
-   Device File: enP33p3s0f2
-   HW Address: 98:be:94:03:18:4a
-   Permanent HW Address: 98:be:94:03:18:4a
-   Link detected: no
-   Config Status: cfg=new, avail=yes, need=no, active=unknown
-   Attached to: #15 (Ethernet controller)
- 
- 38: None 00.0: 10701 Ethernet
-   [Created at net.126]
-   Unique ID: 7Onn.ndpeucax6V1
-   Parent ID: sx0U.aXC4wIvegH8
-   SysFS ID: /class/net/enP33p3s0f0
-   SysFS Device Link: 
/devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.0
-   Hardware Class: network interface
-   Model: "Ethernet network interface"
-   Driver: "tg3"
-   Driver Modules: "tg3"
-   Device File: enP33p3s0f0
-   HW Address: 98:be:94:03:18:48
-   Permanent HW Address: 98:be:94:03:18:48
-   Link detected: yes
-   Config Status: cfg=new, avail=yes, need=no, active=unknown
-   Attached to: #16 (Ethernet controller)
- 
- 39: None 00.0: 10701 Ethernet
-   [Created at net.126]
-   Unique ID: VwX_.ndpeucax6V1
-   Parent ID: DUng.aXC4wIvegH8
-   SysFS ID: /class/net/enP33p3s0f3
-   SysFS Device Link: 
/devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.3
-   Hardware Class: network interface
-   Model: "Ethernet network interface"
-   Driver: "tg3"
-   Driver Modules: "tg3"
-   Device File: enP33p3s0f3
-   HW Address: 98:be:94:03:18:4b
-   Permanent HW Address: 98:be:94:03:18:4b
-   Link detected: no
-   Config Status: cfg=new, avail=yes, need=no, active=unknown
-   Attached to: #25 (Ethernet controller)
- 
- 40: None 00.0: 10701 Ethernet
-   [Created at net.126]
-   Unique ID: bZ1s.ndpeucax6V1
-   Parent ID: J7HY.aXC4wIvegH8
-   SysFS ID: /class/net/enP33p3s0f1
-   SysFS Device Link: 
/devices/pci0021:00/0021:00:00.0/0021:01:00.0/0021:02:01.0/0021:03:00.1
-   Hardware Class: network interface
-   Model: "Ethernet network interface"
-   Driver: "tg3"
-   Driver Modules: "tg3"
-   Device File: enP33p3s0f1
-   HW Address: 98:be:94:03:18:49
-   Permanent HW Address: 98:be:94:03:18:49
-   Link detected: no
-   Config Status: cfg=new, avail=yes, need=no, active=unknown
-   Attached to: #4 (Ethernet controller)
- root@ltc-firep3:~# 
- 
- 
- Thanks,
- Pavithra
- 
- == Comment: #6 - PAVITHRA R. PRAKASH <[email protected]> - 2017-03-07
- 23:20:47 ==
- 
- 
- == Comment: #7 - PAVITHRA R. PRAKASH <[email protected]> - 2017-03-07 
23:21:27 ==
- 
- 
- == Comment: #8 - Urvashi Jawere <[email protected]> - 2017-03-08 02:48:15 ==
- I am able to see some errors in syslog ;
- 
- auxiliary
- Mar  7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed 
for question 114.15.239:/home/ubuntu/test IN SOA: failed-auxiliary
- Mar  7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed 
for question 9.114.15.239:/home/ubuntu/test IN DS: failed-auxiliary
- Mar  7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed 
for question 9.114.15.239:/home/ubuntu/test IN SOA: failed-auxiliary
- Mar  7 04:57:44 ltc-firep3 systemd-resolved[3486]: DNSSEC validation failed 
for question 9.114.15.239:/home/ubuntu/test IN A: failed-auxiliary
- Mar  7 04:57:44 ltc-firep3 systemd-resolved[3486]: Server 9.12.16.2 does not 
support DNSSEC, downgrading to non-DNSSEC mode.
- Mar  7 04:57:44 ltc-firep3 kdump-config: /root/.ssh/id_rsa failed to be sent 
to [email protected]:/home/ubuntu/test
- Mar  7 04:58:04 ltc-firep3 systemd[1]: Reloading.
- Mar  7 04:59:15 ltc-firep3 systemd[1]: Reloading.
- Mar  7 04:59:16 ltc-firep3 kdump-config: propagated ssh key /root/.ssh/id_rsa 
to server [email protected]
- .
- .
- .
- 
- Mar  7 05:06:55 ltc-firep3 systemd[1]: Started Accounts Service.
- Mar  7 05:06:56 ltc-firep3 kdump-tools[3498]: Starting kdump-tools: Modified 
cmdline:root=UUID=1e76cfd5-988c-46f4-bdc4-39fe1ed01152 ro quiet splash irqpoll 
nr_cpus=1 nousb systemd.unit=kdump-tools.service ata_piix.prefer_ms_hyperv=0 
elfcorehdr=155136K
- Mar  7 05:06:57 ltc-firep3 kdump-tools[3498]:  * loaded kdump kernel
- Mar  7 05:06:57 ltc-firep3 kdump-tools: /sbin/kexec -p 
--command-line="root=UUID=1e76cfd5-988c-46f4-bdc4-39fe1ed01152 ro quiet splash 
irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service 
ata_piix.prefer_ms_hyperv=0" --initrd=/var/lib/kdump/initrd.img 
/var/lib/kdump/vmlinuz
- Mar  7 05:06:57 ltc-firep3 kdump-tools: loaded kdump kernel
- Mar  7 05:06:57 ltc-firep3 systemd[1]: Started Kernel crash dump capture 
service.
- Mar  7 05:06:57 ltc-firep3 apport[3584]: ERROR: Cannot create report: [Errno 
17] File exists: '/var/crash/linux-image-4.10.0-9-generic-201703060521.crash'
- Mar  7 05:06:57 ltc-firep3 apport[3584]:    ...done.
- 
- == Comment: #18 - Hari Krishna Bathini <[email protected]> - 2017-03-28 
06:55:20 ==
- Looks like tg3 module was not needed after all. Interesting thing though is
- even after enP34p1s0f0 is up (ifup) and network.online target is reached,
- network was not really active. It took about 30 seconds, after reaching 
- network.online target, for the network to be active, even on a normal boot.
- Adding this wait time in kdump script, before saving dump, ensured that
- vmcore is captured successful. Attaching the log for the same..
- 
- Not sure why enP34p1s0f0 is taking that long to configure/initialize. Even so,
- this delay should be part of ifup/network-online.target if it is inevitable,
- so that network is pingable after network-online.target
-  
- Thanks
- Hari
- 
- == Comment: #19 - Hari Krishna Bathini <[email protected]> - 2017-03-28 
07:01:52 ==
- The workaround snippet adding delay in kdump script:
- 
- 
- --- kdump-config.orig 2017-03-28 03:35:17.753542107 -0500
- +++ kdump-config      2017-03-28 06:59:22.887576623 -0500
- @@ -761,6 +761,7 @@
-       KDUMP_DMESGFILE="$KDUMP_STAMPDIR/dmesg.$KDUMP_STAMP"
-       ERROR=0
-  
- +     sleep 30
-       ssh -i $KDUMP_SSH_KEY $KDUMP_REMOTE_HOST mkdir -p $KDUMP_STAMPDIR
-       ERROR=$?
-       # If remote connections fails, no need to continue
- 
- ---
- 
- Thanks
- Hari
- 
- == Comment: #20 - PAVITHRA R. PRAKASH <[email protected]> - 2017-03-30 
01:33:56 ==
- (In reply to comment #19)
- > The workaround snippet adding delay in kdump script:
- > 
- > 
- > --- kdump-config.orig       2017-03-28 03:35:17.753542107 -0500
- > +++ kdump-config    2017-03-28 06:59:22.887576623 -0500
- > @@ -761,6 +761,7 @@
- >     KDUMP_DMESGFILE="$KDUMP_STAMPDIR/dmesg.$KDUMP_STAMP"
- >     ERROR=0
- >  
- > +   sleep 30
- >     ssh -i $KDUMP_SSH_KEY $KDUMP_REMOTE_HOST mkdir -p $KDUMP_STAMPDIR
- >     ERROR=$?
- >     # If remote connections fails, no need to continue
- > 
- > ---
- > 
- > Thanks
- > Hari
- 
- With above workaround dump captured successfully in remote host.
- 
- Thanks,
- Pavithra
- 
- == Comment: #22 - Hari Krishna Bathini <[email protected]> - 2017-04-10 
22:14:27 ==
- (In reply to comment #18)
- > Created attachment 117088 [details]
- > Console log of successful dump capture after adding a time delay of 'sleep
- > 30'
- > 
- > Looks like tg3 module was not needed after all. Interesting thing though is
- > even after enP34p1s0f0 is up (ifup) and network.online target is reached,
- > network was not really active. It took about 30 seconds, after reaching 
- > network.online target, for the network to be active, even on a normal boot.
- > Adding this wait time in kdump script, before saving dump, ensured that
- > vmcore is captured successful. Attaching the log for the same..
- > 
- > Not sure why enP34p1s0f0 is taking that long to configure/initialize. Even
- > so,
- > this delay should be part of ifup/network-online.target if it is inevitable,
- > so that network is pingable after network-online.target
- 
- Hi Canonical,
- 
- Since this falls outside the realm of kdump, should we add a NET_WAIT_TIME 
field
- in /etc/default/kdump-tools file that defaults to 0 but can be changed when 
the
- user sees timing troubles?
- 
- Thanks
- Hari
+ There's not a clear regression potential here since it's just a retry/delay 
mechanism. Some potential problems may come from bad coding in the script.
+ The delay between attempts is only 3 sec per iteration, so it shouldn't block 
the kdump progress for a high amount of time at once.

** Patch added: "lp1681909_eoan.debdiff"
   
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+attachment/5275117/+files/lp1681909_eoan.debdiff

** Changed in: makedumpfile (Ubuntu Xenial)
       Status: Confirmed => In Progress

** Changed in: makedumpfile (Ubuntu Bionic)
       Status: Confirmed => In Progress

** Changed in: makedumpfile (Ubuntu Cosmic)
       Status: Confirmed => In Progress

** Changed in: makedumpfile (Ubuntu Disco)
       Status: Confirmed => In Progress

** Changed in: makedumpfile (Ubuntu Eoan)
       Status: Confirmed => In Progress

** Changed in: makedumpfile (Ubuntu Disco)
     Assignee: Guilherme G. Piccoli (gpiccoli) => Thadeu Lima de Souza Cascardo 
(cascardo)

** Changed in: makedumpfile (Ubuntu Cosmic)
     Assignee: Guilherme G. Piccoli (gpiccoli) => Thadeu Lima de Souza Cascardo 
(cascardo)

** Changed in: makedumpfile (Ubuntu Bionic)
     Assignee: Guilherme G. Piccoli (gpiccoli) => Thadeu Lima de Souza Cascardo 
(cascardo)

** Changed in: makedumpfile (Ubuntu Xenial)
     Assignee: Guilherme G. Piccoli (gpiccoli) => Thadeu Lima de Souza Cascardo 
(cascardo)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1681909

Title:
  kdump is not captured in remote host when kdump over ssh is configured

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1681909/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to