Bug#516374: INFO: task * blocked for more than 120 seconds. in numerous non-SCHED_IDLE workloads
Hi, Quoting Ben Hutchings b...@decadent.org.uk: This is not exactly a crash, though I realise the effects are often just as bad as a crash. Yep, that's not really a crash. It can take some hours for the host to be unresponsive due to the high load average (ssh, local login down). It's like the system is waiting for the file system. Please send the kernel logs showing the blocked for more than 120 seconds messages and the following function call traces. I will post it as soon as one of my hosts hangs. I do not know yet how to reproduce this bug on command. Regards. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#516374: INFO: task * blocked for more than 120 seconds. in numerous non-SCHED_IDLE workloads
This bug report is about systems that are *not* making use of SCHED_IDLE, which are probably not due to the known scheduler bug. This particular error message typically indicates a lock ordering bug or infinite delay while holding a lock, and can be caused by any kernel component. One such case was fixed in version 2.6.26-16: * Fix soft lockups caused by one md resync blocking on another due to sharing the same device (closes: #514627) Ben. Hi, Unfortunately, this fix did not solve the problem. I booted on the last kernel revision 2.6.26-19-xen yesterday, and it crashed one hour later with the blocked for more than 120 seconds message. I use Xen with a SAN backend (4 FC links / QLogic ISP2432 card) using multipathd. It happens on 54XX and 55XX intel processors ; with or without booting on SAN, and randomly. SADC was the most common process triggering this bug. If I understand well, it should be related to this bug: #517449 ? Regards. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#516374: INFO: task * blocked for more than 120 seconds. in numerous non-SCHED_IDLE workloads
On Mon, 2009-09-07 at 09:25 +0200, Darck wrote: This bug report is about systems that are *not* making use of SCHED_IDLE, which are probably not due to the known scheduler bug. This particular error message typically indicates a lock ordering bug or infinite delay while holding a lock, and can be caused by any kernel component. One such case was fixed in version 2.6.26-16: * Fix soft lockups caused by one md resync blocking on another due to sharing the same device (closes: #514627) Ben. Hi, Unfortunately, this fix did not solve the problem. I booted on the last kernel revision 2.6.26-19-xen yesterday, and it crashed one hour later with the blocked for more than 120 seconds message. This is not exactly a crash, though I realise the effects are often just as bad as a crash. I use Xen with a SAN backend (4 FC links / QLogic ISP2432 card) using multipathd. It happens on 54XX and 55XX intel processors ; with or without booting on SAN, and randomly. SADC was the most common process triggering this bug. If I understand well, it should be related to this bug: #517449 ? Could be. Please send the kernel logs showing the blocked for more than 120 seconds messages and the following function call traces. Ben. -- Ben Hutchings Life is what happens to you while you're busy making other plans. - John Lennon signature.asc Description: This is a digitally signed message part
Bug#516374: INFO: task * blocked for more than 120 seconds. in numerous non-SCHED_IDLE workloads
On Thu, 2009-09-03 at 10:14 +0200, Darck wrote: Hi, Still no news about this problem ?... While upgrading to a newer (unstable) kernel is easy with kernels provided by Debian, this is not the case with the Xen support. The 2.6.26-2 xen release is NOT useable in production environment, and keep crashing the host due to this scheduler bug. This bug report is about systems that are *not* making use of SCHED_IDLE, which are probably not due to the known scheduler bug. This particular error message typically indicates a lock ordering bug or infinite delay while holding a lock, and can be caused by any kernel component. One such case was fixed in version 2.6.26-16: * Fix soft lockups caused by one md resync blocking on another due to sharing the same device (closes: #514627) Ben. -- Ben Hutchings Life is what happens to you while you're busy making other plans. - John Lennon signature.asc Description: This is a digitally signed message part
Bug#516374: INFO: task * blocked for more than 120 seconds. in numerous non-SCHED_IDLE workloads
Hi, Still no news about this problem ?... While upgrading to a newer (unstable) kernel is easy with kernels provided by Debian, this is not the case with the Xen support. The 2.6.26-2 xen release is NOT useable in production environment, and keep crashing the host due to this scheduler bug. At least, is there any known workaround ? other than building his own Xen kernel or switching to another distro that have fixed this bug ? You have people that would be ready to test patches to help fixing this bug, but we need that the Debian team really cares about it... Nowaday, this is not the case :( Regards. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#516374: INFO: task * blocked for more than 120 seconds. in numerous non-SCHED_IDLE workloads
Hi Ben Ben Hutchings ben-at-decadent.org.uk |DebianBug| wrote: On Fri, 2009-06-26 at 14:15 +1200, Brendon Green wrote: I could rebuild the 2.6.26 host and guest kernels for the system in question (I archive the .config files using a local version number which, unfortunately, bears little or no relation to the kernel version number). However, I have also made recent changes to the hardware (PCI PATA card to boost disk from ATA/33 to ATA/133). If you don't think the hardware changes will skew the results, I can temporarily downgrade the kernels, apply the patch (to 2.6.26, regrettably I no longer have Debian's 2.6.28 sources available), and try to reproduce the problem. Please apply this patch to 2.6.26. We want to fix the bug in lenny (2.6.26) and we don't care about .28 any more. I don't think this is hardware-dependent, so don't bother changing your hardware back. Ben. Today, I started playing around with unpatched .26 kernels as host and guest on my server. Regrettably, I am yet to reproduce the bug, although the system *feels* a lot slower with .26 as host _and_ guest. I may yet try booting with the IDE in PIO mode, to artificially degrade CPU and disk performance, and see what happens. On a different note, I applied your patch to my desktop machine (still running 2.6.26, despite being a mix of testing/unstable/experimental) a few weeks ago, in an attempt to improve OpenOffice.org performance. Since doing that, I have been noticing X freezing (mouse moves, but no hot-tracking, and unable to switch tasks or VT's) when OOo, or another application, is busy. Usually, when OOo is busy, I will switch to another task (or IceWeasel ;-) and continue working. I'll give more feedback if and when I manage to tickle the bug on my server. Cheers, Brendon Green -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#516374: INFO: task * blocked for more than 120 seconds. in numerous non-SCHED_IDLE workloads
On Mon, Jun 08, 2009 at 07:14:44PM -0400, John Morrissey wrote: Thanks, Ben. Rebuilt with this patch and threw the resulting kernel on a couple of machines running several KVM VMs. I'll be able to provide confident feedback in a couple of days. These machines have been stable since running kernels including this patch. john -- John Morrissey _o/\ __o j...@horde.net_- \_ / \ \, www.horde.net/__(_)/_(_)/\___(_) /_(_)__ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#516374: INFO: task * blocked for more than 120 seconds. in numerous non-SCHED_IDLE workloads
On Thu, Jun 11, 2009 at 11:53:03AM -0400, John Morrissey wrote: On Mon, Jun 08, 2009 at 07:14:44PM -0400, John Morrissey wrote: Thanks, Ben. Rebuilt with this patch and threw the resulting kernel on a couple of machines running several KVM VMs. I'll be able to provide confident feedback in a couple of days. These machines have been stable since running kernels including this patch. Awesome, thanks for the feedback! -- dann frazier -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#516374: INFO: task * blocked for more than 120 seconds. in numerous non-SCHED_IDLE workloads
On Fri, Jun 05, 2009 at 03:37:33AM +0100, Ben Hutchings wrote: Please try the patch I posted here: http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=54;bug=517449 It includes a fix made between 2.6.26 and .28 that may address this bug. Thanks, Ben. Rebuilt with this patch and threw the resulting kernel on a couple of machines running several KVM VMs. I'll be able to provide confident feedback in a couple of days. john -- John Morrissey _o/\ __o j...@horde.net_- \_ / \ \, www.horde.net/__(_)/_(_)/\___(_) /_(_)__ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#516374: INFO: task * blocked for more than 120 seconds. in numerous non-SCHED_IDLE workloads
Please try the patch I posted here: http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=54;bug=517449 It includes a fix made between 2.6.26 and .28 that may address this bug. Ben. -- Ben Hutchings Logic doesn't apply to the real world. - Marvin Minsky signature.asc Description: This is a digitally signed message part