Bug#939328: linux-image-4.19.0-5-amd64: buster and stretch-backports kernel causes interfaces rename back to ethX on HPe DL360g10
Package: src:linux Version: 4.19.37-5+deb10u2 Severity: normal Hi, installing the latest update caused our NICs to be renamed from "enoX" back to "ethX". We first experienced this issue on Debian/buster on HPe DL360g10 and now found the same issue with the latest upload of the same kernel to stretch-backports. We're now working around this issue by using $ cat /etc/systemd/network/101-onboard-rd.link [Link] NamePolicy=onboard kernel database slot path MACAddressPolicy=none That said on a system still running Debian/stretch with the old working kernel, without the systemd workaround applied I've the following: svenhoexter@docker-018:~$ uname -a Linux docker-018 4.19.0-0.bpo.5-amd64 #1 SMP Debian 4.19.37-4~bpo9+1 (2019-06-19) x86_64 GNU/Linux svenhoexter@docker-018:~$ udevadm test-builtin net_id /sys/class/net/eno5 2>/dev/null ID_NET_NAME_ONBOARD=eno5 ID_NET_NAME_PATH=enp93s0f0 svenhoexter@docker-018:~$ SYSTEMD_LOG_LEVEL=debug udevadm test-builtin net_setup_link /sys/class/net/eno5 calling: test-builtin === trie on-disk === tool version: 232 file size: 8441068 bytes header size 80 bytes strings1846908 bytes nodes 6594080 bytes Load module index Found container virtualization none timestamp of '/etc/systemd/network' changed Skipping overridden file: /lib/systemd/network/99-default.link. Parsed configuration file /etc/systemd/network/99-default.link Created link configuration context. ID_NET_DRIVER=i40e No matching link configuration found. Unload module index Unloaded link configuration context. On a system with the new kernel and the applied workaround we've: svenhoexter@docker-019:~$ uname -a Linux docker-018 4.19.0-5-amd64 #1 SMP Debian 4.19.37-5+deb10u2 (2019-08-08) x86_64 GNU/Linux svenhoexter@docker-019:~$ udevadm test-builtin net_id /sys/class/net/eno5 2>/dev/null ID_NET_NAMING_SCHEME=v240 ID_NET_NAME_ONBOARD=eno5 ID_NET_NAME_PATH=enp93s0f0 svenhoexter@docker-019:~$ SYSTEMD_LOG_LEVEL=debug udevadm test-builtin net_setup_link /sys/class/net/eno5 === trie on-disk === tool version: 241 file size: 9492053 bytes header size 80 bytes strings2069269 bytes nodes 7422704 bytes Load module index Failed to read $container of PID 1, ignoring: Permission denied Found container virtualization none. timestamp of '/etc/systemd/network' changed Skipping overridden file '/usr/lib/systemd/network/99-default.link'. Parsed configuration file /etc/systemd/network/99-default.link Parsed configuration file /etc/systemd/network/101-onboard-rd.link Created link configuration context. ID_NET_DRIVER=i40e Config file /etc/systemd/network/101-onboard-rd.link applies to device eno5 link_config: autonegotiation is unset or enabled, the speed and duplex are not writable. link_config: could not set ethtool features for eno5 Could not set offload features of eno5: Operation not permitted eno5: Device has name_assign_type=4 Using default interface naming scheme 'v240'. eno5: Policy *onboard* yields "eno5". ID_NET_LINK_FILE=/etc/systemd/network/101-onboard-rd.link ID_NET_NAME=eno5 Unload module index Unloaded link configuration context. Any hint on how to provide additional input is appreciated. Sven
Bug#903767: Bug#903800: 4.9.110-1 Xen PV boot workaround
The package is available via stretch-proposed-updates. Just add that one to your sources.list until the next point release or linux security update. HTH, Sven Am 22. Juli 2018 22:48:35 MESZ schrieb Jered Floyd : > >It appears that this ticket has been closed, noting a fix in >linux-4.9.110-2 (source pkg). Will this replace the current >linux-image-4.9.0-7-amd64 in stretch soon? It's currently making >stretch unusable with Xen. > >--Jered > >- On Jul 17, 2018, at 6:23 AM, Hans van Kranenburg h...@knorrie.org >wrote: > >> On 07/17/2018 12:39 AM, Benoît Tonnerre wrote: >>> Hi, >>> >>> I tested this workaround : I confirm that it works on Xen host, but >not >>> on Xen guest. >>> If you try to start a vm with latest kernel i.e. theses parameters >in >>> cfg file : >>> >>> # >>> # Kernel + memory size >>> # >>> kernel = '/boot/vmlinuz-4.9.0-7-amd64' >>> extra = 'elevator=noop' >>> ramdisk = '/boot/initrd.img-4.9.0-7-amd64' >>> >>> The VM crash in loop with kernel error : >>> >>> [...] >>> >>> Did I miss something ? >> >> Yes, the pti=off needs to go in your extra line: >> >> extra = 'elevator=noop pti=off' >> >> Hans > >-- >To unsubscribe, send mail to 903800-unsubscr...@bugs.debian.org.
Bug#670398: Deadlock in hid_reset when Dell iDRAC is reset
On Sun, Jul 15, 2012 at 11:41:33PM +0100, Ben Hutchings wrote: Hi, I assume you mean this patch: http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=65;filename=0001-usb-Fix-deadlock-in-hid_reset-when-Dell-iDRAC.patch;att=1;bug=670398 so I'll apply that. Exactly, that would be great. It won't be accepted into a 2.6.32.y release unless someone can explain how it was fixed upstream (ideally, which commit(s) fixed it). I think it was somewhere mentioned that it got fixed with some USB-HID rewrite in 2.6.36 or 2.6.37. We could not reproduce it with Linux 3.2 from backports and internal builds of 2.6.37. But I can see that this isn't a proper explanation or reason for an inclusion upstream. Sven -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20120716142519.gc23...@sho.bk.hosteurope.de
Bug#670398: Deadlock in hid_reset when Dell iDRAC is reset
On Mon, May 21, 2012 at 10:25:09PM +0530, shyam_i...@dell.com wrote: Hi, We have observed that doing a reset on idrac on low-end server like R|T210 R|T310 triggers the panic whereas the high end servers do not deadlock on an iDRAC reset so we know that this timing dependent. Ah thanks that matches our observations. Ben - I had attached the patch to the earlier thread. Let me know if you need any additional work from me on this. We've now applied that patch to the latest Debian Squeeze Kernel release and indeed fixes the 'racreset' issue. Ben, is there a chance to get that one included in the Debian Kernel or even better in a 2.6.32.x release upstream? Since we see the same issue with Ubuntu 10.04 I've to open a bugreport with them aswell. Sven -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20120529074240.ga2...@sho.bk.hosteurope.de
Bug#670398: Deadlock in hid_reset when Dell iDRAC is reset
On Tue, May 01, 2012 at 10:15:37AM +0530, shyam_i...@dell.com wrote: Hi, Was the usb reset issue found while resetting the iDRAC ? Resetting the iDRAC is an out of band process and has to be issued via a separate management network to the iDRAC. I found the time to test this issue in several OS-Hardware combinations: R210 Squeeze - hangs R210 Ubuntu 10.04 - recovers (to my surprise) R210 II Squeeze - hangs R210 II Ubuntu 10.04 - hangs R210 II CentOS 6.1 - hangs (expected, just tried that to be sure) R710 Squeeze - not effected Sven -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20120515094541.ga3...@sho.bk.hosteurope.de
Bug#670398: Deadlock in hid_reset when Dell iDRAC is reset
On Tue, May 01, 2012 at 10:15:37AM +0530, shyam_i...@dell.com wrote: Hi, It doesn't seem like this is the same bug. Was the usb reset issue found while resetting the iDRAC ? Ok we just tried a 'racadm racreset hard' on a R210 and yes we can reproduce that issue. We would highly appreciate it to get the fix for that issue included in the Debian/squeeze kernel aswell. Maybe it would fix our original issue aswell but that needs to be tested. I've just requested some test hardware at my workplace to conduct further testing. Sven -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20120502103151.ga18...@sho.bk.hosteurope.de
Bug#670398: Deadlock in hid_reset when Dell iDRAC is reset
On Tue, May 01, 2012 at 10:15:37AM +0530, shyam_i...@dell.com wrote: Hi, It doesn't seem like this is the same bug. Was the usb reset issue found while resetting the iDRAC ? No, during normal operation. I think nobody even used the iDRAC of those systems between the last boot and the appearance of this issue. While we had to use 'racreset hard' rather frequently with the old DRAC 4 cards I can't really remember we had to use it with the current iDRAC cards in R210 and R210-II based systems at all. To me it still looks like this could be a symptomatic log of this bug BZ#772884 On large SMP systems, the TSC (Time Stamp Counter) clock frequency could be incorrectly calculated. The discrepancy between the correct value and the incorrect value was within 0.5%. When the system rebooted, this small error would result in the system becoming out of synchronization with an external reference clock (typically a NTP server). With this update, the TSC frequency calculation has been improved and the clock correctly maintains synchronization with external reference clocks. I'm not sure what counts as 'large SMP system' here. The systems we see this mostly on are R210 with an Intel X3430 CPU. Last week we had a first appearance of this issue on a R210-II system equipped with a E3-1220 CPU. They're all quad core single socket systems. We're are using ntpd in the default installation, so it should've been involved on all systems. Sven -- And I don't know much, but I do know this: With a golden heart comes a rebel fist. [ Streetlight Manifesto - Here's To Life ] -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20120501083747.GA2990@marvin
Bug#670398: linux-image-2.6.32-5-amd64: SSH logins hang while hpet interrupts multiply on Intel Nehalem CPUs
On Fri, Apr 27, 2012 at 04:19:21AM +0100, Ben Hutchings wrote: Hi, So it looks like in this case at least you're seeing a bug in USB error recovery and not anything to do with timing using the TSC vs HPET. Ok, a few minutes ago I got aware of another system with the given symptoms with a SandyBridge CPU (E3-1220) and the same CallTrace. I'm now wondering what I could do to help to debug such issue further? All those Dell systems (R210 and R210 II) are equipped with DRAC KVM cards. They expose the input of the KVM Java applet as USB devices to the system. For at least one of the effected systems I'm pretty sure that nobody tried to use the KVM applet while that issue appearead but that doesn't mean anything. External USB devices were not connected to the systems. Kernel is linux-image-2.6.32-5-amd64 2.6.32-38 Sven -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20120427133047.ga12...@sho.bk.hosteurope.de
Bug#670398: linux-image-2.6.32-5-amd64: SSH logins hang while hpet interrupts multiply on Intel Nehalem CPUs
On Thu, Apr 26, 2012 at 04:49:56AM +0100, Ben Hutchings wrote: On Wed, 2012-04-25 at 10:36 +0200, Sven Hoexter wrote: Hi, Searching through munin graphs we could narrow down the starting point of this issue to the point when the hpet interrupts for one CPU core multiplied. Sometimes they multiplied by six. Looking further we've found the Kernel [events/$x] in state D where $x is the number of the CPU core which has the high number of hpet interrupts. When we started strace -f on the sshd master process everything works until you logout. Then you'll again see the forked sshd process hanging in state D. This is strange, because D state means uninterruptible sleep (not handling signals). But perhaps the sshd process was repeatedly changing between uninterruptible and interruptible state. Is it possible to gather such data? I guess grep'ing through ps output is not the right tool here. From a system currently suffering from this issue: ps aux|grep D USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root15 0.0 0.0 0 0 ?DApr25 0:53 [events/0] root 4162 0.0 0.0 0 0 ?Ds 08:33 0:00 [bash] 480 7875 0.0 0.0 0 0 ?Ds 09:28 0:00 [bash] root 9407 0.0 0.0 76644 3392 ?Ds 09:49 0:00 sshd: root@pts/79 480 11310 0.0 0.0 8940 884 ?S09:59 0:00 grep D 480 11765 0.0 0.0 0 0 ?Ds Apr25 0:00 [bash] root 12803 0.0 0.0 76644 3392 ?Ds Apr25 0:00 sshd: root@pts/12 root 13762 0.0 0.0 76644 3392 ?Ds Apr25 0:00 sshd: root@pts/73 root 15111 0.0 0.0 0 0 ?Ds Apr25 0:00 [bash] root 19361 0.0 0.0 0 0 ?Ds Apr25 0:00 [bash] root 20966 0.0 0.0 0 0 ?Ds Apr25 0:00 [bash] root 29323 0.0 0.0 0 0 ?Ds Apr25 0:00 [bash] Sven -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20120426080254.ga4...@sho.bk.hosteurope.de
Bug#670398: linux-image-2.6.32-5-amd64: SSH logins hang while hpet interrupts multiply on Intel Nehalem CPUs
On Wed, Apr 25, 2012 at 09:33:44PM -0700, John Stultz wrote: Hi, When you can connect to the system that is having problems, do you see any problems with the time? ie: does date show the correct time, and does it increment normally? I don't see any jumps in time here: while true; do /sbin/hwclock --show; date; done Thu 26 Apr 2012 09:50:19 AM CEST -0.781566 seconds Thu Apr 26 09:50:18 CEST 2012 Thu 26 Apr 2012 09:50:20 AM CEST -1.000351 seconds Thu Apr 26 09:50:19 CEST 2012 Thu 26 Apr 2012 09:50:21 AM CEST -1.000321 seconds Thu Apr 26 09:50:20 CEST 2012 Thu 26 Apr 2012 09:50:22 AM CEST -1.000299 seconds Thu Apr 26 09:50:21 CEST 2012 Thu 26 Apr 2012 09:50:23 AM CEST -1.000342 seconds Thu Apr 26 09:50:22 CEST 2012 Thu 26 Apr 2012 09:50:24 AM CEST -1.000315 seconds Thu Apr 26 09:50:23 CEST 2012 Thu 26 Apr 2012 09:50:25 AM CEST -0.984703 seconds Thu Apr 26 09:50:24 CEST 2012 Thu 26 Apr 2012 09:50:26 AM CEST -1.000319 seconds Thu Apr 26 09:50:25 CEST 2012 Thu 26 Apr 2012 09:50:27 AM CEST -1.000324 seconds Thu Apr 26 09:50:26 CEST 2012 Thu 26 Apr 2012 09:50:28 AM CEST -1.000320 seconds Thu Apr 26 09:50:27 CEST 2012 Thu 26 Apr 2012 09:50:29 AM CEST -1.000326 seconds Thu Apr 26 09:50:28 CEST 2012 Thu 26 Apr 2012 09:50:30 AM CEST -1.000307 seconds Thu Apr 26 09:50:29 CEST 2012 Thu 26 Apr 2012 09:50:31 AM CEST -1.000337 seconds Thu Apr 26 09:50:30 CEST 2012 Thu 26 Apr 2012 09:50:32 AM CEST -1.004597 seconds Thu Apr 26 09:50:32 CEST 2012 Thu 26 Apr 2012 09:50:33 AM CEST -0.984694 seconds Thu Apr 26 09:50:32 CEST 2012 Thu 26 Apr 2012 09:50:34 AM CEST -1.000320 seconds Thu Apr 26 09:50:33 CEST 2012 Thu 26 Apr 2012 09:50:35 AM CEST -1.000321 seconds Thu Apr 26 09:50:34 CEST 2012 Thu 26 Apr 2012 09:50:36 AM CEST -1.000320 seconds Thu Apr 26 09:50:35 CEST 2012 Thu 26 Apr 2012 09:50:37 AM CEST -1.000321 seconds Thu Apr 26 09:50:36 CEST 2012 Thu 26 Apr 2012 09:50:38 AM CEST -1.004611 seconds Thu Apr 26 09:50:38 CEST 2012 Thu 26 Apr 2012 09:50:39 AM CEST -0.984718 seconds Thu Apr 26 09:50:38 CEST 2012 Thu 26 Apr 2012 09:50:40 AM CEST -1.000274 seconds Thu Apr 26 09:50:39 CEST 2012 Thu 26 Apr 2012 09:50:41 AM CEST -1.000315 seconds It sounds like if there is some HPET irq issue, it would likely be due to some sort of global wakeup to handle local apics that halt in deep sleep modes. Its likely that getting /proc/timer_list output would help (both before and after the problem). I've attached /proc/timer_list from a system that is currently suffering from this problem. Unfortunately I've not the state before. I can only grab it from a system that is currently not effected but suffered in the past if it helps? Regards, Sven Timer List Version: v0.5 HRTIMER_MAX_CLOCK_BASES: 2 now at 83615781793881 nsecs cpu: 0 clock 0: .base: 880008a100c8 .index: 0 .resolution: 1 nsecs .get_time: ktime_get_real .offset: 1335342613769994271 nsecs active timers: #0: 8801a0967d08, hrtimer_wakeup, S:01, futex_wait_queue_me, java/18099 # expires at 1335426229557084000-1335426229557134000 nsecs [in 1335342613775290119 to 1335342613775340119 nsecs] #1: 88023deb7d08, hrtimer_wakeup, S:01, futex_wait_queue_me, java/18106 # expires at 1335426254026767000-1335426254026817000 nsecs [in 1335342638244973119 to 1335342638245023119 nsecs] #2: 8801f184dd08, hrtimer_wakeup, S:01, futex_wait_queue_me, java/18105 # expires at 1335426270572844000-1335426270572894000 nsecs [in 1335342654791050119 to 1335342654791100119 nsecs] #3: 8801f3d9bd08, hrtimer_wakeup, S:01, futex_wait_queue_me, java/18110 # expires at 1335426275040889000-1335426275040939000 nsecs [in 1335342659259095119 to 1335342659259145119 nsecs] #4: 8801a097fd08, hrtimer_wakeup, S:01, futex_wait_queue_me, java/18111 # expires at 133542628422666-133542628422671 nsecs [in 1335342668444866119 to 1335342668444916119 nsecs] clock 1: .base: 880008a10108 .index: 1 .resolution: 1 nsecs .get_time: ktime_get .offset: 0 nsecs active timers: #0: 880008a101b0, tick_sched_timer, S:01, tick_nohz_restart_sched_tick, swapper/0 # expires at 8361578400-8361578400 nsecs [in 2206119 to 2206119 nsecs] #1: 88021ebbb968, hrtimer_wakeup, S:01, schedule_hrtimeout_range, init/5900 # expires at 83616008255524-83616013255522 nsecs [in 226461643 to 231461641 nsecs] #2: 8802181f1968, hrtimer_wakeup, S:01, schedule_hrtimeout_range, init/6795 # expires at 83616584188235-83616589188233 nsecs [in 802394354 to 807394352 nsecs] #3: 88021d9af968, hrtimer_wakeup, S:01, schedule_hrtimeout_range, init/6032 # expires at 83616708138883-83616713138881 nsecs [in 926345002 to 931345000 nsecs] #4: 88023030d968, hrtimer_wakeup, S:01, schedule_hrtimeout_range, mysqld/4211 # expires at 83616968110539-83616973110538 nsecs [in 1186316658 to 1191316657 nsecs] #5: 880215f7b968, hrtimer_wakeup, S:01, schedule_hrtimeout_range, init/7172 # expires at
Bug#670398: linux-image-2.6.32-5-amd64: SSH logins hang while hpet interrupts multiply on Intel Nehalem CPUs
On Thu, Apr 26, 2012 at 01:45:30PM +0100, Ben Hutchings wrote: Hi, You can use 'echo w /proc/sysrq-trigger' to get a traceback for all the tasks in D state, which might provide some clues. ok, see the attached file. Regards, Sven Apr 26 16:08:34 vdf1 kernel: [6726714.281854] SysRq : Show Blocked State Apr 26 16:08:34 vdf1 kernel: [6726714.281883] taskPC stack pid father Apr 26 16:08:34 vdf1 kernel: [6726714.281912] events/0 D 015 2 0x Apr 26 16:08:34 vdf1 kernel: [6726714.281946] 814891f0 0046 88043e4e5c0c Apr 26 16:08:34 vdf1 kernel: [6726714.281997] 88000fa15780 f9e0 88043e4e5fd8 Apr 26 16:08:34 vdf1 kernel: [6726714.282048] 00015780 00015780 88043e4e8000 88043e4e82f8 Apr 26 16:08:34 vdf1 kernel: [6726714.282099] Call Trace: Apr 26 16:08:34 vdf1 kernel: [6726714.282127] [8105af0e] ? __mod_timer+0x141/0x153 Apr 26 16:08:34 vdf1 kernel: [6726714.287558] [8105a9f4] ? try_to_del_timer_sync+0x63/0x6c Apr 26 16:08:34 vdf1 kernel: [6726714.287589] [812fbd24] ? schedule_timeout+0xa5/0xdd Apr 26 16:08:34 vdf1 kernel: [6726714.287617] [8105aa88] ? process_timeout+0x0/0x5 Apr 26 16:08:34 vdf1 kernel: [6726714.287651] [a00fc8f7] ? ehci_endpoint_disable+0xa4/0x141 [ehci_hcd] Apr 26 16:08:34 vdf1 kernel: [6726714.287699] [a00b8f0d] ? usb_ep0_reinit+0x13/0x34 [usbcore] Apr 26 16:08:34 vdf1 kernel: [6726714.287731] [a00b970a] ? usb_reset_and_verify_device+0x87/0x3d6 [usbcore] Apr 26 16:08:34 vdf1 kernel: [6726714.287779] [a00be1c7] ? usb_kill_urb+0x10/0xbb [usbcore] Apr 26 16:08:34 vdf1 kernel: [6726714.287811] [a00be1c7] ? usb_kill_urb+0x10/0xbb [usbcore] Apr 26 16:08:34 vdf1 kernel: [6726714.287842] [a00b9aed] ? usb_reset_device+0x94/0x124 [usbcore] Apr 26 16:08:34 vdf1 kernel: [6726714.287873] [a00d96a0] ? hid_reset+0x91/0x122 [usbhid] Apr 26 16:08:34 vdf1 kernel: [6726714.287903] [810619f7] ? worker_thread+0x188/0x21d Apr 26 16:08:34 vdf1 kernel: [6726714.287932] [a00d960f] ? hid_reset+0x0/0x122 [usbhid] Apr 26 16:08:34 vdf1 kernel: [6726714.287961] [8106502a] ? autoremove_wake_function+0x0/0x2e Apr 26 16:08:34 vdf1 kernel: [6726714.287990] [8106186f] ? worker_thread+0x0/0x21d Apr 26 16:08:34 vdf1 kernel: [6726714.288018] [81064d5d] ? kthread+0x79/0x81 Apr 26 16:08:34 vdf1 kernel: [6726714.288046] [81011baa] ? child_rip+0xa/0x20 Apr 26 16:08:34 vdf1 kernel: [6726714.288072] [81064ce4] ? kthread+0x0/0x81 Apr 26 16:08:34 vdf1 kernel: [6726714.288098] [81011ba0] ? child_rip+0x0/0x20 Apr 26 16:08:34 vdf1 kernel: [6726714.288131] sshd D 0 17858 17854 0x Apr 26 16:08:34 vdf1 kernel: [6726714.288164] 88043e473880 0086 0ffc Apr 26 16:08:34 vdf1 kernel: [6726714.288215] f9e0 880270a55fd8 Apr 26 16:08:34 vdf1 kernel: [6726714.288266] 00015780 00015780 88043c100710 88043c100a08 Apr 26 16:08:34 vdf1 kernel: [6726714.288316] Call Trace: Apr 26 16:08:34 vdf1 kernel: [6726714.288340] [8100f6c4] ? __switch_to+0x1ad/0x297 Apr 26 16:08:34 vdf1 kernel: [6726714.288367] [812fbcad] ? schedule_timeout+0x2e/0xdd Apr 26 16:08:34 vdf1 kernel: [6726714.288397] [810482ed] ? finish_task_switch+0x3a/0xaf Apr 26 16:08:34 vdf1 kernel: [6726714.288426] [812fb8f0] ? thread_return+0x79/0xe0 Apr 26 16:08:34 vdf1 kernel: [6726714.288453] [812fbb64] ? wait_for_common+0xde/0x15b Apr 26 16:08:34 vdf1 kernel: [6726714.288482] [8104a4cc] ? default_wake_function+0x0/0x9 Apr 26 16:08:34 vdf1 kernel: [6726714.288510] [81062326] ? flush_work+0x75/0x87 Apr 26 16:08:34 vdf1 kernel: [6726714.288538] [81061d00] ? wq_barrier_func+0x0/0x9 Apr 26 16:08:34 vdf1 kernel: [6726714.288567] [811fbd44] ? n_tty_poll+0x5e/0x138 Apr 26 16:08:34 vdf1 kernel: [6726714.288594] [811f8732] ? tty_poll+0x56/0x6d Apr 26 16:08:34 vdf1 kernel: [6726714.288622] [810fc782] ? do_select+0x37b/0x57a Apr 26 16:08:34 vdf1 kernel: [6726714.288650] [810fcdfb] ? __pollwait+0x0/0xd6 Apr 26 16:08:34 vdf1 kernel: [6726714.288677] [810fced1] ? pollwake+0x0/0x5b Apr 26 16:08:34 vdf1 kernel: [6726714.288704] [810fced1] ? pollwake+0x0/0x5b Apr 26 16:08:34 vdf1 kernel: [6726714.288731] [810fced1] ? pollwake+0x0/0x5b Apr 26 16:08:34 vdf1 kernel: [6726714.288757] [810fced1] ? pollwake+0x0/0x5b Apr 26 16:08:34 vdf1 kernel: [6726714.288785] [8127f29b] ? tcp_recvmsg+0x98b/0xa9e Apr 26 16:08:34 vdf1 kernel: [6726714.288813] [8103fc9e] ? update_curr+0xa6/0x147 Apr 26 16:08:34 vdf1 kernel:
Bug#670398: linux-image-2.6.32-5-amd64: SSH logins hang while hpet interrupts multiply on Intel Nehalem CPUs
Package: linux-image-2.6.32-5-amd64 Version: 2.6.32-41squeeze2 Severity: important Hi, since about December 2011 we've seen systems were SSH sessions suddenly hang and further logins on the physical TTY or via SSH are no longer possible. In some cases ssh logins still work and you see motd and mayeb can even issue one or two commands. (I've brought this issue up on debian-user in march with a private reply from a fellow DD yesterday http://lists.debian.org/debian-user/2012/03/msg01204.html) Over time we observed that ssh logins without PTS (ssh -T) still work. Looking at other sessions sshd was in state and D entries in /dev/pts/ were created correctly. Searching through munin graphs we could narrow down the starting point of this issue to the point when the hpet interrupts for one CPU core multiplied. Sometimes they multiplied by six. Looking further we've found the Kernel [events/$x] in state D where $x is the number of the CPU core which has the high number of hpet interrupts. When we started strace -f on the sshd master process everything works until you logout. Then you'll again see the forked sshd process hanging in state D. Up to that point we've seen this issue exclusively on Linux 2.6.32 based systems, most often on Debian/Squeeze and less often on Ubuntu 10.04 and once or twice on a RHEL 6.1 system. Searching further I've seen references on a Dell PowerEdge mailinglist referencing RedHat BZ#750201 and Intel CPU errata number AAO67 for Nehalem (rapid C state switching). The RedHat bug is currently non-public but through our technical contact at RedHat I was able to receive a summary of this bug and other referenced bugs which describe more or less exactly our issue. According to RedHat that should be fixed in their Kernel 2.6.32-220.7.1.el6 citing the following in the changelog: - [x86] hpet: Disable per-cpu hpet timer if ARAT is supported (Prarit Bhargava) [772884 750201] - [x86] Improve TSC calibration using a delayed workqueue (Prarit Bhargava) [772884 750201] - [kernel] clocksource: Add clocksource_register_hz/khz interface (Prarit Bhargava) [772884 750201] - [kernel] clocksource: Provide a generic mult/shift factor calculation (Prarit Bhargava) [772884 750201] (Maybe that helps to track down the relevant changes.) As a workaround it could work to disable C-states in the BIOS or on the Kernel commandline with intel_idle.max_cstate=0 processor.max_cstate=1. Since we run into that issue only from time to time on the same system we could not yet verify either workaround. Rumours indicate that sometimes disabling it in the BIOS did not help because the Kernel enabled C-states again. My current guess is that it's somehow related to the Intel Nehalem CPU bug and only happens if you have a high single threaded load which leads to one or core cores are switched into a C-6 sleep state so that they can overclock one core. Marketing name is TurboBoost. Regarding the CPUs I know this happens with: - Intel X3430 - Intel X3450 - Intel L3426 We see it in almost all cases on Dell R210 with the X3430 CPUs. Rumours claim it also happens with other Dell models based on other CPUs from the Intel Nehalem series with TurboBoost. Would be great if someone could track down the needed changes and incorporate those into a point release. In general I would be available for testing but we still have no way reproduce it beside waiting a few month. :( Regards, Sven -- System Information: Debian Release: 6.0.4 APT prefers stable APT policy: (500, 'stable') Architecture: amd64 (x86_64) Kernel: Linux 3.2.0-0.bpo.1-amd64 (SMP w/4 CPU cores) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Versions of packages linux-image-2.6.32-5-amd64 depends on: ii debconf [debconf-2.0] 1.5.36.1 Debian configuration management sy ii initramfs-tools [linux-init 0.99~bpo60+1 tools for generating an initramfs ii linux-base 3.4~bpo60+1 Linux image base package ii module-init-tools 3.12-2 tools for managing Linux kernel mo Versions of packages linux-image-2.6.32-5-amd64 recommends: pn firmware-linux-free none (no description available) Versions of packages linux-image-2.6.32-5-amd64 suggests: pn grub | lilo none (no description available) pn linux-doc-2.6.32 none (no description available) -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20120425083611.ga4...@sho.bk.hosteurope.de