Re: PROBLEM: All CPUs in soft lockup
On Wed, Mar 27, 2013 at 12:55:41PM +1100, Robert Norris wrote: > The console shows a new "BUG: soft lockup" line every few seconds Looking closer, the whole thing starts with a _hard_ lockup. 2013-03-26T08:33:39.921834-04:00 imap30 kernel: [185090.090328] Watchdog detected hard LOCKUP on cpu 3 (also in the logs of the other two servers I mentioned). Looking down to where the watchdog interrupt comes in: 2013-03-26T08:33:39.921870-04:00 imap30 kernel: [185090.090426] <> [] ? end_buffer_async_read+0x79/0xff Disassembling: 0x8112a57a <+66>:mov%rbx,%rdi 0x8112a57d <+69>:callq 0x81129265 0x8112a582 <+74>:lock orb $0x2,0x0(%rbp) 0x8112a587 <+79>:mov0x0(%rbp),%rax 0x8112a58b <+83>:test $0x8,%ah 0x8112a58e <+86>:jne0x8112a594 0x8112a590 <+88>:ud2 0x8112a592 <+90>:jmp0x8112a592 That lock at +74 is presumably the offender here. Which is line 275 of fs/buffer.c: 275 SetPageError(page); So another CPU has these page flags locked right now, and isn't keen to release that lock? I don't know how to debug this further. What's the next step? Thanks, Rob. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: All CPUs in soft lockup
On Wed, Mar 27, 2013, at 02:42 PM, li guang wrote: > seems tasks are hogging your cpu/memory resource, did you check status > your servicing processes? According to my monitoring I have plenty of CPU and memory free at the time the problem occurs. What specifically are you looking at the data I provided that makes you think that? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: All CPUs in soft lockup
On Wed, Mar 27, 2013, at 02:42 PM, li guang wrote: seems tasks are hogging your cpu/memory resource, did you check status your servicing processes? According to my monitoring I have plenty of CPU and memory free at the time the problem occurs. What specifically are you looking at the data I provided that makes you think that? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: All CPUs in soft lockup
On Wed, Mar 27, 2013 at 12:55:41PM +1100, Robert Norris wrote: The console shows a new BUG: soft lockup line every few seconds Looking closer, the whole thing starts with a _hard_ lockup. 2013-03-26T08:33:39.921834-04:00 imap30 kernel: [185090.090328] Watchdog detected hard LOCKUP on cpu 3 (also in the logs of the other two servers I mentioned). Looking down to where the watchdog interrupt comes in: 2013-03-26T08:33:39.921870-04:00 imap30 kernel: [185090.090426] EOE IRQ [8112a5b1] ? end_buffer_async_read+0x79/0xff Disassembling: 0x8112a57a +66:mov%rbx,%rdi 0x8112a57d +69:callq 0x81129265 buffer_io_error 0x8112a582 +74:lock orb $0x2,0x0(%rbp) 0x8112a587 +79:mov0x0(%rbp),%rax 0x8112a58b +83:test $0x8,%ah 0x8112a58e +86:jne0x8112a594 end_buffer_async_read+92 0x8112a590 +88:ud2 0x8112a592 +90:jmp0x8112a592 end_buffer_async_read+90 That lock at +74 is presumably the offender here. Which is line 275 of fs/buffer.c: 275 SetPageError(page); So another CPU has these page flags locked right now, and isn't keen to release that lock? I don't know how to debug this further. What's the next step? Thanks, Rob. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: All CPUs in soft lockup
seems tasks are hogging your cpu/memory resource, did you check status your servicing processes? 在 2013-03-27三的 12:55 +1100,Robert Norris写道: > In the last two weeks we've had three servers (identical hardware, > software and load) hang. The details in this report are from one that > hung last night. > > They're all IMAP servers servicing many hundreds of users, so several > thousand processes and active connections. There's been two major > application level changes in the last couple of weeks, corresponding to > the time where these hangs started. One is that we now do mail event > notifications directly to user clients, so more TCP connections. The > other is that we're now maintaining live search indexes, so a lot more > disk and tmpfs IO. > > All that said, we're not under what we'd consider to be heavy load. When > they're running, the servers are fast and responsive. > > During the hang itself, the machine responds to pings, and TCP > connections can be established, but the servicing processes never > respond. The console shows a new "BUG: soft lockup" line every few > seconds, and will not respond to keyboard input. It is a virtual console > though, which may or may not make a difference, I'm not sure. > > The kernel is 3.4.33 with AUFS patches applied. However there are no > AUFS mounts on this machine; we use this elsewhere. If you think that's > a problem I can rebuild for this machine without it. > > Attached are various bits of information requested in REPORTING-BUGS. > I'm not entirely sure what else is relevant. I'm happy to supply any > other information and test things, just let me know. > > Thanks, > Rob. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: All CPUs in soft lockup
seems tasks are hogging your cpu/memory resource, did you check status your servicing processes? 在 2013-03-27三的 12:55 +1100,Robert Norris写道: In the last two weeks we've had three servers (identical hardware, software and load) hang. The details in this report are from one that hung last night. They're all IMAP servers servicing many hundreds of users, so several thousand processes and active connections. There's been two major application level changes in the last couple of weeks, corresponding to the time where these hangs started. One is that we now do mail event notifications directly to user clients, so more TCP connections. The other is that we're now maintaining live search indexes, so a lot more disk and tmpfs IO. All that said, we're not under what we'd consider to be heavy load. When they're running, the servers are fast and responsive. During the hang itself, the machine responds to pings, and TCP connections can be established, but the servicing processes never respond. The console shows a new BUG: soft lockup line every few seconds, and will not respond to keyboard input. It is a virtual console though, which may or may not make a difference, I'm not sure. The kernel is 3.4.33 with AUFS patches applied. However there are no AUFS mounts on this machine; we use this elsewhere. If you think that's a problem I can rebuild for this machine without it. Attached are various bits of information requested in REPORTING-BUGS. I'm not entirely sure what else is relevant. I'm happy to supply any other information and test things, just let me know. Thanks, Rob. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/