RE: Sytem slowdown on 2.4.1-ac20 (recurring from 2.4.0)
Just FYI: I remember posting something a few days ago to make the serial console more reliable for such situations. Some allocations in the serial port driver are done at runtime using page_alloc, if somebody runs out of memory the serial tty driver would not work properly. I am not saying that u ran out of memory. All I am saying is that it is possible to make the serial tty driver more reliable using boot time initialization. Please excuse me if u find this a little off-topic. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Vibol Hou Sent: Monday, February 26, 2001 4:25 PM To: Linux-Kernel Subject: Sytem slowdown on 2.4.1-ac20 (recurring from 2.4.0) I've reported this problem a long while ago, but no one answered my pleas. To tell you the honest truth, I don't know where to begin looking. It's difficult to poke around when the serial console is unresponsive :/ When I was running 2.4.0, the system, a dual-processor webserver, would _completely_ slow down after about 3 days of constant uptime (and a few million pages served). I mean _SLOW_. I could get commands executed, but it would take an unholy long time to type the commands in. It seemed the server was dropping lots of packets. All TCP services simply stopped or slowed. ICMP packet loss to the server would be a sporadic from 50% to 75%. Web service was rendered useless. SSH _barely_ worked. The number of commands I could run (w, free, memstat, top) showed nothing out of the ordinary. Back then, I didn't have a serial console setup. Now, I'm running 2.4.1-ac20 and I setup a serial console to try to catch any errors. I was hoping the problem wouldn't recur with this newer kernel, but it seems to still happen, but now at about 5 days uptime. When I manage to get in a 'shutdown -h now' through SSH, the serial console spits out: INIT: Switching to runlevel: 0 INIT: And that's it. It doesn't even seem to be able to finish shutting down. Thusfar, no one else has reported any similar problems to what I have, so it makes me wonder what is wrong. The system ran fine with an uptime of over 100 days with the old 2.2.17 kernel. What stifles me is the fact that the serial console is completely unresponsive to input when the server gets into this state. Having said that, does anyone have any ideas or pointers for me? Again, this may seem like a fairly indescriptive e-mail, but that's just because I can't do anything on the server when it gets to this state. If there is anything you recommend I do when this happens again (other than restart the system), please let me know and I'll try. -- Vibol Hou KhmerConnection, http://khmer.cc - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Sytem slowdown on 2.4.1-ac20 (recurring from 2.4.0)
Just FYI: I remember posting something a few days ago to make the serial console more reliable for such situations. Some allocations in the serial port driver are done at runtime using page_alloc, if somebody runs out of memory the serial tty driver would not work properly. I am not saying that u ran out of memory. All I am saying is that it is possible to make the serial tty driver more reliable using boot time initialization. Please excuse me if u find this a little off-topic. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Vibol Hou Sent: Monday, February 26, 2001 4:25 PM To: Linux-Kernel Subject: Sytem slowdown on 2.4.1-ac20 (recurring from 2.4.0) I've reported this problem a long while ago, but no one answered my pleas. To tell you the honest truth, I don't know where to begin looking. It's difficult to poke around when the serial console is unresponsive :/ When I was running 2.4.0, the system, a dual-processor webserver, would _completely_ slow down after about 3 days of constant uptime (and a few million pages served). I mean _SLOW_. I could get commands executed, but it would take an unholy long time to type the commands in. It seemed the server was dropping lots of packets. All TCP services simply stopped or slowed. ICMP packet loss to the server would be a sporadic from 50% to 75%. Web service was rendered useless. SSH _barely_ worked. The number of commands I could run (w, free, memstat, top) showed nothing out of the ordinary. Back then, I didn't have a serial console setup. Now, I'm running 2.4.1-ac20 and I setup a serial console to try to catch any errors. I was hoping the problem wouldn't recur with this newer kernel, but it seems to still happen, but now at about 5 days uptime. When I manage to get in a 'shutdown -h now' through SSH, the serial console spits out: INIT: Switching to runlevel: 0 INIT: And that's it. It doesn't even seem to be able to finish shutting down. Thusfar, no one else has reported any similar problems to what I have, so it makes me wonder what is wrong. The system ran fine with an uptime of over 100 days with the old 2.2.17 kernel. What stifles me is the fact that the serial console is completely unresponsive to input when the server gets into this state. Having said that, does anyone have any ideas or pointers for me? Again, this may seem like a fairly indescriptive e-mail, but that's just because I can't do anything on the server when it gets to this state. If there is anything you recommend I do when this happens again (other than restart the system), please let me know and I'll try. -- Vibol Hou KhmerConnection, http://khmer.cc - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Sytem slowdown on 2.4.1-ac20 (recurring from 2.4.0)
Vibol Hou wrote: > > > Are you still getting the "hordes" of Tx timeouts with the > > 3c905B which you reported a week ago? > > Yes, but they happen a few hours after the system starts up and continue > until the server is restarted. It seems like a separate issue. I haven't > tried taking down the interface and putting it back up since the interface > never dies link others have reported with their NICs. The NIC continues to > work fine, though the logs get flooded with those messages. > > > If so, do they only start coming out when the slowdown occurs? > > That's a negative. > > > You are probably a victim of the APIC bug. A > > workaround for this is present in 2.4.2-ac5. Alternatively, > > boot the kernel with the `noapic' LILO option. > > I'll compile 2.4.2-ac5 and we'll see in another 5 days if this happens > again. Till then, any suggestions on what to look for/at and/or what to do > when it happens will help. OK. The 'Interrupt posted but not delivered' message means that the Ethernet controller thinks that it is driving the physical interrupt line, but the CPUs aren't being interrupted. Check /proc/interrupts, see if the NIC's IRQ is shared with something else. If it isn't, or if it is shared with something reputable then, given that the machine works OK with 2.2 kernels then it's probably the APIC. But it's unusual that the system "continues to work fine". Ususally, a busted APIC slows networking to a crawl. We generate an artificial interrupt once per 400 milliseconds via the Tx timeout handler. This can process 16 outgoing packets and 32 incoming packets. This `polled mode' is present in many Linux network drivers - it's there so you can still telnet into the machine and whack it when it's being silly. - - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Sytem slowdown on 2.4.1-ac20 (recurring from 2.4.0)
> Are you still getting the "hordes" of Tx timeouts with the > 3c905B which you reported a week ago? Yes, but they happen a few hours after the system starts up and continue until the server is restarted. It seems like a separate issue. I haven't tried taking down the interface and putting it back up since the interface never dies link others have reported with their NICs. The NIC continues to work fine, though the logs get flooded with those messages. > If so, do they only start coming out when the slowdown occurs? That's a negative. > You are probably a victim of the APIC bug. A > workaround for this is present in 2.4.2-ac5. Alternatively, > boot the kernel with the `noapic' LILO option. I'll compile 2.4.2-ac5 and we'll see in another 5 days if this happens again. Till then, any suggestions on what to look for/at and/or what to do when it happens will help. Thanks! -- Vibol Hou KhmerConnection, http://khmer.cc -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Andrew Morton Sent: Monday, February 26, 2001 4:44 PM To: Vibol Hou Cc: Linux-Kernel Subject: Re: Sytem slowdown on 2.4.1-ac20 (recurring from 2.4.0) Vibol Hou wrote: > > I've reported this problem a long while ago, but no one answered my pleas. > To tell you the honest truth, I don't know where to begin looking. It's > difficult to poke around when the serial console is unresponsive :/ > Sounds like a network driver problem. Are you still getting the "hordes" of Tx timeouts with the 3c905B which you reported a week ago? If so, do they only start coming out when the slowdown occurs? You are probably a victim of the APIC bug. A workaround for this is present in 2.4.2-ac5. Alternatively, boot the kernel with the `noapic' LILO option. Please let us know the outcome. - - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Sytem slowdown on 2.4.1-ac20 (recurring from 2.4.0)
Vibol Hou wrote: > > I've reported this problem a long while ago, but no one answered my pleas. > To tell you the honest truth, I don't know where to begin looking. It's > difficult to poke around when the serial console is unresponsive :/ > Sounds like a network driver problem. Are you still getting the "hordes" of Tx timeouts with the 3c905B which you reported a week ago? If so, do they only start coming out when the slowdown occurs? You are probably a victim of the APIC bug. A workaround for this is present in 2.4.2-ac5. Alternatively, boot the kernel with the `noapic' LILO option. Please let us know the outcome. - - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Sytem slowdown on 2.4.1-ac20 (recurring from 2.4.0)
Vibol Hou wrote: I've reported this problem a long while ago, but no one answered my pleas. To tell you the honest truth, I don't know where to begin looking. It's difficult to poke around when the serial console is unresponsive :/ Sounds like a network driver problem. Are you still getting the "hordes" of Tx timeouts with the 3c905B which you reported a week ago? If so, do they only start coming out when the slowdown occurs? You are probably a victim of the APIC bug. A workaround for this is present in 2.4.2-ac5. Alternatively, boot the kernel with the `noapic' LILO option. Please let us know the outcome. - - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Sytem slowdown on 2.4.1-ac20 (recurring from 2.4.0)
Are you still getting the "hordes" of Tx timeouts with the 3c905B which you reported a week ago? Yes, but they happen a few hours after the system starts up and continue until the server is restarted. It seems like a separate issue. I haven't tried taking down the interface and putting it back up since the interface never dies link others have reported with their NICs. The NIC continues to work fine, though the logs get flooded with those messages. If so, do they only start coming out when the slowdown occurs? That's a negative. You are probably a victim of the APIC bug. A workaround for this is present in 2.4.2-ac5. Alternatively, boot the kernel with the `noapic' LILO option. I'll compile 2.4.2-ac5 and we'll see in another 5 days if this happens again. Till then, any suggestions on what to look for/at and/or what to do when it happens will help. Thanks! -- Vibol Hou KhmerConnection, http://khmer.cc -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Andrew Morton Sent: Monday, February 26, 2001 4:44 PM To: Vibol Hou Cc: Linux-Kernel Subject: Re: Sytem slowdown on 2.4.1-ac20 (recurring from 2.4.0) Vibol Hou wrote: I've reported this problem a long while ago, but no one answered my pleas. To tell you the honest truth, I don't know where to begin looking. It's difficult to poke around when the serial console is unresponsive :/ Sounds like a network driver problem. Are you still getting the "hordes" of Tx timeouts with the 3c905B which you reported a week ago? If so, do they only start coming out when the slowdown occurs? You are probably a victim of the APIC bug. A workaround for this is present in 2.4.2-ac5. Alternatively, boot the kernel with the `noapic' LILO option. Please let us know the outcome. - - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Sytem slowdown on 2.4.1-ac20 (recurring from 2.4.0)
Vibol Hou wrote: Are you still getting the "hordes" of Tx timeouts with the 3c905B which you reported a week ago? Yes, but they happen a few hours after the system starts up and continue until the server is restarted. It seems like a separate issue. I haven't tried taking down the interface and putting it back up since the interface never dies link others have reported with their NICs. The NIC continues to work fine, though the logs get flooded with those messages. If so, do they only start coming out when the slowdown occurs? That's a negative. You are probably a victim of the APIC bug. A workaround for this is present in 2.4.2-ac5. Alternatively, boot the kernel with the `noapic' LILO option. I'll compile 2.4.2-ac5 and we'll see in another 5 days if this happens again. Till then, any suggestions on what to look for/at and/or what to do when it happens will help. OK. The 'Interrupt posted but not delivered' message means that the Ethernet controller thinks that it is driving the physical interrupt line, but the CPUs aren't being interrupted. Check /proc/interrupts, see if the NIC's IRQ is shared with something else. If it isn't, or if it is shared with something reputable then, given that the machine works OK with 2.2 kernels then it's probably the APIC. But it's unusual that the system "continues to work fine". Ususally, a busted APIC slows networking to a crawl. We generate an artificial interrupt once per 400 milliseconds via the Tx timeout handler. This can process 16 outgoing packets and 32 incoming packets. This `polled mode' is present in many Linux network drivers - it's there so you can still telnet into the machine and whack it when it's being silly. - - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/