Re: PROBLEM: All CPUs in soft lockup

2013-03-27 Thread Robert Norris
On Wed, Mar 27, 2013 at 12:55:41PM +1100, Robert Norris wrote:
> The console shows a new "BUG: soft lockup" line every few seconds

Looking closer, the whole thing starts with a _hard_ lockup.

  2013-03-26T08:33:39.921834-04:00 imap30 kernel: [185090.090328] Watchdog 
detected hard LOCKUP on cpu 3

(also in the logs of the other two servers I mentioned).

Looking down to where the watchdog interrupt comes in:

  2013-03-26T08:33:39.921870-04:00 imap30 kernel: [185090.090426] <>  
  [] ? end_buffer_async_read+0x79/0xff

Disassembling:

  0x8112a57a <+66>:mov%rbx,%rdi
  0x8112a57d <+69>:callq  0x81129265 
  0x8112a582 <+74>:lock orb $0x2,0x0(%rbp)
  0x8112a587 <+79>:mov0x0(%rbp),%rax
  0x8112a58b <+83>:test   $0x8,%ah
  0x8112a58e <+86>:jne0x8112a594 

  0x8112a590 <+88>:ud2
  0x8112a592 <+90>:jmp0x8112a592 


That lock at +74 is presumably the offender here. Which is line 275 of 
fs/buffer.c:

  275 SetPageError(page);

So another CPU has these page flags locked right now, and isn't keen to
release that lock?

I don't know how to debug this further. What's the next step?

Thanks,
Rob.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: All CPUs in soft lockup

2013-03-27 Thread Robert Norris
On Wed, Mar 27, 2013, at 02:42 PM, li guang wrote:
> seems tasks are hogging your cpu/memory resource, did you check status
> your servicing processes?

According to my monitoring I have plenty of CPU and memory free at the
time the problem occurs. What specifically are you looking at the data I
provided that makes you think that?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: All CPUs in soft lockup

2013-03-27 Thread Robert Norris
On Wed, Mar 27, 2013, at 02:42 PM, li guang wrote:
 seems tasks are hogging your cpu/memory resource, did you check status
 your servicing processes?

According to my monitoring I have plenty of CPU and memory free at the
time the problem occurs. What specifically are you looking at the data I
provided that makes you think that?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: All CPUs in soft lockup

2013-03-27 Thread Robert Norris
On Wed, Mar 27, 2013 at 12:55:41PM +1100, Robert Norris wrote:
 The console shows a new BUG: soft lockup line every few seconds

Looking closer, the whole thing starts with a _hard_ lockup.

  2013-03-26T08:33:39.921834-04:00 imap30 kernel: [185090.090328] Watchdog 
detected hard LOCKUP on cpu 3

(also in the logs of the other two servers I mentioned).

Looking down to where the watchdog interrupt comes in:

  2013-03-26T08:33:39.921870-04:00 imap30 kernel: [185090.090426] EOE  
IRQ  [8112a5b1] ? end_buffer_async_read+0x79/0xff

Disassembling:

  0x8112a57a +66:mov%rbx,%rdi
  0x8112a57d +69:callq  0x81129265 buffer_io_error
  0x8112a582 +74:lock orb $0x2,0x0(%rbp)
  0x8112a587 +79:mov0x0(%rbp),%rax
  0x8112a58b +83:test   $0x8,%ah
  0x8112a58e +86:jne0x8112a594 
end_buffer_async_read+92
  0x8112a590 +88:ud2
  0x8112a592 +90:jmp0x8112a592 
end_buffer_async_read+90

That lock at +74 is presumably the offender here. Which is line 275 of 
fs/buffer.c:

  275 SetPageError(page);

So another CPU has these page flags locked right now, and isn't keen to
release that lock?

I don't know how to debug this further. What's the next step?

Thanks,
Rob.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: All CPUs in soft lockup

2013-03-26 Thread li guang
seems tasks are hogging your cpu/memory resource,
did you check status your servicing processes?

在 2013-03-27三的 12:55 +1100,Robert Norris写道:
> In the last two weeks we've had three servers (identical hardware,
> software and load) hang. The details in this report are from one that
> hung last night.
> 
> They're all IMAP servers servicing many hundreds of users, so several
> thousand processes and active connections. There's been two major
> application level changes in the last couple of weeks, corresponding to
> the time where these hangs started. One is that we now do mail event
> notifications directly to user clients, so more TCP connections. The
> other is that we're now maintaining live search indexes, so a lot more
> disk and tmpfs IO.
> 
> All that said, we're not under what we'd consider to be heavy load. When
> they're running, the servers are fast and responsive.
> 
> During the hang itself, the machine responds to pings, and TCP
> connections can be established, but the servicing processes never
> respond. The console shows a new "BUG: soft lockup" line every few
> seconds, and will not respond to keyboard input. It is a virtual console
> though, which may or may not make a difference, I'm not sure.
> 
> The kernel is 3.4.33 with AUFS patches applied. However there are no
> AUFS mounts on this machine; we use this elsewhere. If you think that's
> a problem I can rebuild for this machine without it.
> 
> Attached are various bits of information requested in REPORTING-BUGS.
> I'm not entirely sure what else is relevant. I'm happy to supply any
> other information and test things, just let me know.
> 
> Thanks,
> Rob.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: All CPUs in soft lockup

2013-03-26 Thread li guang
seems tasks are hogging your cpu/memory resource,
did you check status your servicing processes?

在 2013-03-27三的 12:55 +1100,Robert Norris写道:
 In the last two weeks we've had three servers (identical hardware,
 software and load) hang. The details in this report are from one that
 hung last night.
 
 They're all IMAP servers servicing many hundreds of users, so several
 thousand processes and active connections. There's been two major
 application level changes in the last couple of weeks, corresponding to
 the time where these hangs started. One is that we now do mail event
 notifications directly to user clients, so more TCP connections. The
 other is that we're now maintaining live search indexes, so a lot more
 disk and tmpfs IO.
 
 All that said, we're not under what we'd consider to be heavy load. When
 they're running, the servers are fast and responsive.
 
 During the hang itself, the machine responds to pings, and TCP
 connections can be established, but the servicing processes never
 respond. The console shows a new BUG: soft lockup line every few
 seconds, and will not respond to keyboard input. It is a virtual console
 though, which may or may not make a difference, I'm not sure.
 
 The kernel is 3.4.33 with AUFS patches applied. However there are no
 AUFS mounts on this machine; we use this elsewhere. If you think that's
 a problem I can rebuild for this machine without it.
 
 Attached are various bits of information requested in REPORTING-BUGS.
 I'm not entirely sure what else is relevant. I'm happy to supply any
 other information and test things, just let me know.
 
 Thanks,
 Rob.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/