Bug#516374: INFO: task * blocked for more than 120 seconds. in numerous non-SCHED_IDLE workloads

2009-09-08 Thread Darck

Hi,

Quoting Ben Hutchings b...@decadent.org.uk:


This is not exactly a crash, though I realise the effects are often just
as bad as a crash.


Yep, that's not really a crash. It can take some hours for the host to  
be unresponsive due to the high load average (ssh, local login down).  
It's like the system is waiting for the file system.



Please send the kernel logs showing the blocked for more than 120
seconds messages and the following function call traces.


I will post it as soon as one of my hosts hangs. I do not know yet how  
to reproduce this bug on command.


Regards.




--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#516374: INFO: task * blocked for more than 120 seconds. in numerous non-SCHED_IDLE workloads

2009-09-07 Thread Darck

This bug report is about systems that are *not* making use of
SCHED_IDLE, which are probably not due to the known scheduler bug.  This
particular error message typically indicates a lock ordering bug or
infinite delay while holding a lock, and can be caused by any kernel
component.

One such case was fixed in version 2.6.26-16:

 * Fix soft lockups caused by one md resync blocking on another due
   to sharing the same device (closes: #514627)

Ben.


Hi,

Unfortunately, this fix did not solve the problem. I booted on the  
last kernel revision 2.6.26-19-xen yesterday, and it crashed one hour  
later with the blocked for more than 120 seconds message.


I use Xen with a SAN backend (4 FC links / QLogic ISP2432 card) using  
multipathd. It happens on 54XX and 55XX intel processors ; with or  
without booting on SAN, and randomly. SADC was the most common process  
triggering this bug.


If I understand well, it should be related to this bug: #517449 ?

Regards.




--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#516374: INFO: task * blocked for more than 120 seconds. in numerous non-SCHED_IDLE workloads

2009-09-07 Thread Ben Hutchings
On Mon, 2009-09-07 at 09:25 +0200, Darck wrote:
  This bug report is about systems that are *not* making use of
  SCHED_IDLE, which are probably not due to the known scheduler bug.  This
  particular error message typically indicates a lock ordering bug or
  infinite delay while holding a lock, and can be caused by any kernel
  component.
 
  One such case was fixed in version 2.6.26-16:
 
   * Fix soft lockups caused by one md resync blocking on another due
 to sharing the same device (closes: #514627)
 
  Ben.
 
 Hi,
 
 Unfortunately, this fix did not solve the problem. I booted on the  
 last kernel revision 2.6.26-19-xen yesterday, and it crashed one hour  
 later with the blocked for more than 120 seconds message.

This is not exactly a crash, though I realise the effects are often just
as bad as a crash.

 I use Xen with a SAN backend (4 FC links / QLogic ISP2432 card) using  
 multipathd. It happens on 54XX and 55XX intel processors ; with or  
 without booting on SAN, and randomly. SADC was the most common process  
 triggering this bug.
 
 If I understand well, it should be related to this bug: #517449 ?

Could be.

Please send the kernel logs showing the blocked for more than 120
seconds messages and the following function call traces.

Ben.

-- 
Ben Hutchings
Life is what happens to you while you're busy making other plans.
   - John Lennon


signature.asc
Description: This is a digitally signed message part


Bug#516374: INFO: task * blocked for more than 120 seconds. in numerous non-SCHED_IDLE workloads

2009-09-06 Thread Ben Hutchings
On Thu, 2009-09-03 at 10:14 +0200, Darck wrote:
 Hi,
 
 Still no news about this problem ?...
 
 While upgrading to a newer (unstable) kernel is easy with kernels  
 provided by Debian, this is not the case with the Xen support.
 
 The 2.6.26-2 xen release is NOT useable in production environment, and  
 keep crashing the host due to this scheduler bug.

This bug report is about systems that are *not* making use of
SCHED_IDLE, which are probably not due to the known scheduler bug.  This
particular error message typically indicates a lock ordering bug or
infinite delay while holding a lock, and can be caused by any kernel
component.

One such case was fixed in version 2.6.26-16:

  * Fix soft lockups caused by one md resync blocking on another due
to sharing the same device (closes: #514627)

Ben.

-- 
Ben Hutchings
Life is what happens to you while you're busy making other plans.
   - John Lennon


signature.asc
Description: This is a digitally signed message part


Bug#516374: INFO: task * blocked for more than 120 seconds. in numerous non-SCHED_IDLE workloads

2009-09-03 Thread Darck

Hi,

Still no news about this problem ?...

While upgrading to a newer (unstable) kernel is easy with kernels  
provided by Debian, this is not the case with the Xen support.


The 2.6.26-2 xen release is NOT useable in production environment, and  
keep crashing the host due to this scheduler bug.


At least, is there any known workaround ? other than building his own  
Xen kernel or switching to another distro that have fixed this bug ?


You have people that would be ready to test patches to help fixing  
this bug, but we need that the Debian team really cares about it...  
Nowaday, this is not the case :(


Regards.



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#516374: INFO: task * blocked for more than 120 seconds. in numerous non-SCHED_IDLE workloads

2009-07-20 Thread Brendon Green

Hi Ben

Ben Hutchings ben-at-decadent.org.uk |DebianBug| wrote:

On Fri, 2009-06-26 at 14:15 +1200, Brendon Green wrote:
  
I could rebuild the 2.6.26 host and guest kernels for the system in 
question (I archive the .config files using a local version number 
which, unfortunately, bears little or no relation to the kernel version 
number).  However, I have also made recent changes to the hardware (PCI 
PATA card to boost disk from ATA/33 to ATA/133).


If you don't think the hardware changes will skew the results, I can 
temporarily downgrade the kernels, apply the patch (to 2.6.26, 
regrettably I no longer have Debian's 2.6.28 sources available), and try 
to reproduce the problem.



Please apply this patch to 2.6.26.  We want to fix the bug in lenny
(2.6.26) and we don't care about .28 any more.  I don't think this is
hardware-dependent, so don't bother changing your hardware back.

Ben.

  


Today, I started playing around with unpatched .26 kernels as host and 
guest on my server.  Regrettably, I am yet to reproduce the bug, 
although the system *feels* a lot slower with .26 as host _and_ guest.  
I may yet try booting with the IDE in PIO mode, to artificially degrade 
CPU and disk performance, and see what happens.


On a different note, I applied your patch to my desktop machine (still 
running 2.6.26, despite being a mix of testing/unstable/experimental) a 
few weeks ago, in an attempt to improve OpenOffice.org performance.


Since doing that, I have been noticing X freezing (mouse moves, but no 
hot-tracking, and unable to switch tasks or VT's) when OOo, or another 
application, is busy.  Usually, when OOo is busy, I will switch to 
another task (or IceWeasel ;-) and continue working.


I'll give more feedback if and when I manage to tickle the bug on my server.

Cheers,
Brendon Green




--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#516374: INFO: task * blocked for more than 120 seconds. in numerous non-SCHED_IDLE workloads

2009-06-11 Thread John Morrissey
On Mon, Jun 08, 2009 at 07:14:44PM -0400, John Morrissey wrote:
 Thanks, Ben. Rebuilt with this patch and threw the resulting kernel on a
 couple of machines running several KVM VMs. I'll be able to provide
 confident feedback in a couple of days.

These machines have been stable since running kernels including this patch.

john
-- 
John Morrissey  _o/\   __o
j...@horde.net_- \_  /  \   \,
www.horde.net/__(_)/_(_)/\___(_) /_(_)__



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#516374: INFO: task * blocked for more than 120 seconds. in numerous non-SCHED_IDLE workloads

2009-06-11 Thread dann frazier
On Thu, Jun 11, 2009 at 11:53:03AM -0400, John Morrissey wrote:
 On Mon, Jun 08, 2009 at 07:14:44PM -0400, John Morrissey wrote:
  Thanks, Ben. Rebuilt with this patch and threw the resulting kernel on a
  couple of machines running several KVM VMs. I'll be able to provide
  confident feedback in a couple of days.
 
 These machines have been stable since running kernels including this patch.

Awesome, thanks for the feedback!

-- 
dann frazier




-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#516374: INFO: task * blocked for more than 120 seconds. in numerous non-SCHED_IDLE workloads

2009-06-08 Thread John Morrissey
On Fri, Jun 05, 2009 at 03:37:33AM +0100, Ben Hutchings wrote:
 Please try the patch I posted here:
 
 http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=54;bug=517449
 
 It includes a fix made between 2.6.26 and .28 that may address this bug.

Thanks, Ben. Rebuilt with this patch and threw the resulting kernel on a
couple of machines running several KVM VMs. I'll be able to provide
confident feedback in a couple of days.

john
-- 
John Morrissey  _o/\   __o
j...@horde.net_- \_  /  \   \,
www.horde.net/__(_)/_(_)/\___(_) /_(_)__



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#516374: INFO: task * blocked for more than 120 seconds. in numerous non-SCHED_IDLE workloads

2009-06-04 Thread Ben Hutchings
Please try the patch I posted here:

http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=54;bug=517449

It includes a fix made between 2.6.26 and .28 that may address this bug.

Ben.

-- 
Ben Hutchings
Logic doesn't apply to the real world. - Marvin Minsky


signature.asc
Description: This is a digitally signed message part