Re: [PATCH] raise MAX_SERVER_LIMIT
On Tue, Jan 27, 2004 at 02:24:46PM -0500, Jeff Trawick wrote: I'm testing with this patch currently (so far so good): Same here, I've applied the patch, and right now have 1 hours uptime, which is 12 times more than I've ever had with worker before. Looks like that was it. Where do I send the beer? Oh, and if someone is committing to worker.c, do me a favour and add a zero to MAX_SERVER_LIMIT there too ;) Index: server/mpm/worker/worker.c === RCS file: /home/cvs/httpd-2.0/server/mpm/worker/worker.c,v retrieving revision 1.145 diff -u -r1.145 worker.c --- server/mpm/worker/worker.c27 Jan 2004 15:19:58 - 1.145 +++ server/mpm/worker/worker.c27 Jan 2004 19:20:10 - @@ -1441,7 +1441,8 @@ ++idle_thread_count; } } -if (any_dead_threads totally_free_length idle_spawn_rate +if (any_dead_threads totally_free_length idle_spawn_rate + free_length MAX_SPAWN_RATE (!ps-pid /* no process in the slot */ || ps-quiescing)) { /* or at least one is going away */ if (all_dead_threads) { -- Colm MacCárthaighPublic Key: [EMAIL PROTECTED]
Re: [PATCH] raise MAX_SERVER_LIMIT
On Wed, Jan 28, 2004 at 10:40:54AM +, Colm MacCarthaigh wrote: On Tue, Jan 27, 2004 at 02:24:46PM -0500, Jeff Trawick wrote: I'm testing with this patch currently (so far so good): Same here, I've applied the patch, and right now have 1 hours uptime, which is 12 times more than I've ever had with worker before. Looks like that was it. Where do I send the beer? Still going well, no problems since the patch, but after all that, I can finally benchmark worker Vs prefork properly: Prefork: Server Software:Apache/2.0.48 Server Hostname:ftp.heanet.ie Server Port:80 Document Path: / Document Length:2841 bytes Concurrency Level: 40 Time taken for tests: 31.223 seconds Complete requests: 1 Failed requests:0 Broken pipe errors: 0 Total transferred: 2991 bytes HTML transferred: 2841 bytes Requests per second:320.28 [#/sec] (mean) Time per request: 124.89 [ms] (mean) Time per request: 3.12 [ms] (mean, across all concurrent requests) Transfer rate: 957.95 [Kbytes/sec] received Connnection Times (ms) min mean[+/-sd] median max Connect:0 01.5 051 Processing:17 124 192.2 54 1692 Waiting: 14 124 192.2 54 1691 Total: 17 124 192.3 55 1693 Percentage of the requests served within a certain time (ms) 50% 55 66% 67 75% 84 80%108 90%337 95%547 98%706 99%861 100% 1693 (last request) Worker: Server Software:Apache/2.0.48 Server Hostname:ftp.heanet.ie Server Port:80 Document Path: / Document Length:2841 bytes Concurrency Level: 40 Time taken for tests: 39.907 seconds Complete requests: 1 Failed requests:0 Broken pipe errors: 0 Total transferred: 2991 bytes HTML transferred: 2841 bytes Requests per second:250.58 [#/sec] (mean) Time per request: 159.63 [ms] (mean) Time per request: 3.99 [ms] (mean, across all concurrent requests) Transfer rate: 749.49 [Kbytes/sec] received Connnection Times (ms) min mean[+/-sd] median max Connect:0 00.5 017 Processing: 1 159 187.8 90 1540 Waiting:1 159 187.8 90 1540 Total: 1 159 187.8 90 1540 Percentage of the requests served within a certain time (ms) 50% 90 66%124 75%176 80%206 90%345 95%590 98%865 99%968 100% 1540 (last request) Oh well :) -- Colm MacCárthaighPublic Key: [EMAIL PROTECTED]
Re: [PATCH] raise MAX_SERVER_LIMIT
Colm MacCarthaigh wrote: On Tue, Jan 27, 2004 at 02:24:46PM -0500, Jeff Trawick wrote: I'm testing with this patch currently (so far so good): Same here, I've applied the patch, and right now have 1 hours uptime, which is 12 times more than I've ever had with worker before. Looks like that was it. Where do I send the beer? cool! hold off on the beer for now... I couldn't go through a keg before it went flat, and the cans and bottles don't really do it for me, draft device or not... Oh, and if someone is committing to worker.c, do me a favour and add a zero to MAX_SERVER_LIMIT there too ;) yup; MAX_THREAD_LIMIT should be increased too, as I have come close enough to 20,000 ThreadsPerChild on AIX before to wonder if I could/would hit it... I'll make a note to go through the Unix MPMs soon-ish and give some more breathing room
Re: [PATCH] raise MAX_SERVER_LIMIT
worker MPM stack corruption in parent: int free_slots[MAX_SPAWN_RATE]; ... /* great! we prefer these, because the new process can * start more threads sooner. So prioritize this slot * by putting it ahead of any slots with active threads. * * first, make room by moving a slot that's potentially still * in use to the end of the array */ NEW CODE- ap_assert(free_length MAX_SPAWN_RATE); free_slots[free_length] = free_slots[totally_free_length]; [Tue Jan 27 12:20:19 2004] [crit] [Tue Jan 27 12:20:19 2004] file worker.c, line 1590, assertion free_length MAX_SPAWN_RATE failed
Re: [PATCH] raise MAX_SERVER_LIMIT
Jeff Trawick wrote: worker MPM stack corruption in parent: int free_slots[MAX_SPAWN_RATE]; ... /* great! we prefer these, because the new process can * start more threads sooner. So prioritize this slot * by putting it ahead of any slots with active threads. * * first, make room by moving a slot that's potentially still * in use to the end of the array */ NEW CODE- ap_assert(free_length MAX_SPAWN_RATE); free_slots[free_length] = free_slots[totally_free_length]; [Tue Jan 27 12:20:19 2004] [crit] [Tue Jan 27 12:20:19 2004] file worker.c, line 1590, assertion free_length MAX_SPAWN_RATE failed I'm testing with this patch currently (so far so good): Index: server/mpm/worker/worker.c === RCS file: /home/cvs/httpd-2.0/server/mpm/worker/worker.c,v retrieving revision 1.145 diff -u -r1.145 worker.c --- server/mpm/worker/worker.c 27 Jan 2004 15:19:58 - 1.145 +++ server/mpm/worker/worker.c 27 Jan 2004 19:20:10 - @@ -1441,7 +1441,8 @@ ++idle_thread_count; } } -if (any_dead_threads totally_free_length idle_spawn_rate +if (any_dead_threads totally_free_length idle_spawn_rate + free_length MAX_SPAWN_RATE (!ps-pid /* no process in the slot */ || ps-quiescing)) { /* or at least one is going away */ if (all_dead_threads) {
Re: [PATCH] raise MAX_SERVER_LIMIT
Colm MacCarthaigh wrote: On Mon, Jan 26, 2004 at 06:28:03PM +, Colm MacCarthaigh wrote: I'd love to find out what's causing your worker failures. Are you using any thread-unsafe modules or libraries? Not to my knowledge, I wasn't planning to do this till later, but I've bumped to 2.1, I'll try out the forensic_id and backtrace modules right now, and see how that goes. *sigh*, forensic_id didn't catch it, backtrace didn't catch it, whatkilledus didn't catch it, all tried individually. The parent just dumps core; the children live on, serve their content and log their request and then drop off one by one. No uncomplete requests, no backtrace or other exception info thrown into any log. corefile is as useful as ever, unbacktracable. suggestions welcome! mod_log_forensic? -- http://www.apache-ssl.org/ben.html http://www.thebunker.net/ There is no limit to what a man can do or how far he can go if he doesn't mind who gets the credit. - Robert Woodruff
Re: [PATCH] raise MAX_SERVER_LIMIT
On Thu, Jan 15, 2004 at 04:04:38PM +, Colm MacCarthaigh wrote: There were other changes co-incidental to that, like going to 12Gb of RAM, which certainly helped, so it's hard to narrow it down too much. Ok with 18,000 or so child processes (all in the run queue) what does your load look like? Also, what kind of memory footprint are you seeing? I don't use worker because it still dumps an un-backtracable corefile within about 5 minutes for me. I still have no idea why, though plenty of corefiles. I havn't tried a serious analysis yet, becasue I've been moving house, but I hope to get to it soon. Moving to worker would be a good thing :) I'd love to find out what's causing your worker failures. Are you using any thread-unsafe modules or libraries? -aaron
Re: [PATCH] raise MAX_SERVER_LIMIT
On Mon, Jan 26, 2004 at 10:09:20AM -0800, Aaron Bannert wrote: On Thu, Jan 15, 2004 at 04:04:38PM +, Colm MacCarthaigh wrote: There were other changes co-incidental to that, like going to 12Gb of RAM, which certainly helped, so it's hard to narrow it down too much. Ok with 18,000 or so child processes (all in the run queue) what does your load look like? Also, what kind of memory footprint are you seeing? At the time, we were seeing a laod of between 8 and 15, varying like a sawtooth waveform. It would climb and climb, there'd be a steady sharp decrease and the cycle would star again. At one point I miscompiled in the Linux pre-empt options into the Kernel, and that made things very intresting. Load was much more radically in it's mood-swings then. There would be points when it would slow to a crawl, and the ammount of data we shipped was down - we only managed to peak at 200Mbit/sec during the heaviest part of it. Our daily peak is about 380Mbit, but hopefully we'll be more ready next time. I've managed to commision the second server, and move the updates to it, see: http://ftp.heanet.ie/about/ for an idea of the architecture. As for memory footprint, it wasn't too bad, I actually put the system into 4Gb mode to avoid bounce-buffering - something I hadn't fully mapped out yet. We were using all of the RAM, but that's not unusual for us, we aggressively cache as much of the filesystem as XFS lets us. All of the Apache instances added up to about 165Mb of RAM. I don't use worker because it still dumps an un-backtracable corefile within about 5 minutes for me. I still have no idea why, though plenty of corefiles. I havn't tried a serious analysis yet, becasue I've been moving house, but I hope to get to it soon. Moving to worker would be a good thing :) I'd love to find out what's causing your worker failures. Are you using any thread-unsafe modules or libraries? Not to my knowledge, I wasn't planning to do this till later, but I've bumped to 2.1, I'll try out the forensic_id and backtrace modules right now, and see how that goes. -- Colm MacCárthaighPublic Key: [EMAIL PROTECTED]
Re: [PATCH] raise MAX_SERVER_LIMIT
On Mon, Jan 26, 2004 at 06:28:03PM +, Colm MacCarthaigh wrote: I'd love to find out what's causing your worker failures. Are you using any thread-unsafe modules or libraries? Not to my knowledge, I wasn't planning to do this till later, but I've bumped to 2.1, I'll try out the forensic_id and backtrace modules right now, and see how that goes. *sigh*, forensic_id didn't catch it, backtrace didn't catch it, whatkilledus didn't catch it, all tried individually. The parent just dumps core; the children live on, serve their content and log their request and then drop off one by one. No uncomplete requests, no backtrace or other exception info thrown into any log. corefile is as useful as ever, unbacktracable. suggestions welcome! -- Colm MacCárthaighPublic Key: [EMAIL PROTECTED]
Re: [PATCH] raise MAX_SERVER_LIMIT
On Mon, Jan 26, 2004 at 07:37:23PM +, Colm MacCarthaigh wrote: On Mon, Jan 26, 2004 at 06:28:03PM +, Colm MacCarthaigh wrote: I'd love to find out what's causing your worker failures. Are you using any thread-unsafe modules or libraries? Not to my knowledge, I wasn't planning to do this till later, but I've bumped to 2.1, I'll try out the forensic_id and backtrace modules right now, and see how that goes. *sigh*, forensic_id didn't catch it, backtrace didn't catch it, whatkilledus didn't catch it, all tried individually. The parent just dumps core; the children live on, serve their content and log their request and then drop off one by one. No uncomplete requests, no backtrace or other exception info thrown into any log. corefile is as useful as ever, unbacktracable. suggestions welcome! Have you tried setting up a signal handler for SIGSEGV and calling kill(getpid(), SIGSTOP); in the signal handler? After attaching to the process with gdb, send a CONT signal to the process from another terminal. It's worth a shot. (Is the process dying from SIGSEGV or some other signal? Does the core file tell you?) Can you get a tcpdump of the traffic leading up to the crash? (Yeah I know it would be a lot) If you can get a tcpdump, and then can replay the traffic and reproduce it, more of us can look at this. Cheers, Glenn
Re: [PATCH] raise MAX_SERVER_LIMIT
Colm MacCarthaigh wrote: On Mon, Jan 26, 2004 at 06:28:03PM +, Colm MacCarthaigh wrote: I'd love to find out what's causing your worker failures. Are you using any thread-unsafe modules or libraries? Not to my knowledge, I wasn't planning to do this till later, but I've bumped to 2.1, I'll try out the forensic_id and backtrace modules right now, and see how that goes. *sigh*, forensic_id didn't catch it, forensic_id is just for crash in child backtrace didn't catch it, whatkilledus didn't catch it, all tried individually. disable the check for geteuid()==0 and see if you get backtrace? exception hook purposefully doesn't run as root (I assume your parent is running as root)
Re: [PATCH] raise MAX_SERVER_LIMIT
On Mon, Jan 26, 2004 at 04:25:58PM -0500, Jeff Trawick wrote: *sigh*, forensic_id didn't catch it, forensic_id is just for crash in child I know, but I couldnt rule out a crash in the child being a root cause ... until now, it doesn't look like it's trigger by a particular URI anyway. backtrace didn't catch it, whatkilledus didn't catch it, all tried individually. disable the check for geteuid()==0 and see if you get backtrace? exception hook purposefully doesn't run as root (I assume your parent is running as root) No problem, first thing tomorrow :) -- Colm MacCárthaighPublic Key: [EMAIL PROTECTED]
Re: [PATCH] raise MAX_SERVER_LIMIT
Colm MacCarthaigh wrote: Not entirely serious, but today, we actually hit this, in production :) The hardware, a dual 2Ghz Xeon with 12Gb RAM with Linux 2.6.1-rc2 coped, and remained responsive. So 20,000 may no longer be outside the realms of what administrators reasonably desire to have. -#define MAX_SERVER_LIMIT 2 +#define MAX_SERVER_LIMIT 10 dang! Committed a limit of 20. A couple of observations: * I don't think you could do this with an early 2.4 kernel on i386 because of eating up kernel memory with LDTs, assuming APR thinks it can support threads. Not sure about current kernels from popular distros. * Should I assume you tried worker but it uses too much CPU? If so, is it a small percentage more or really really bad? Thanks, Greg
Re: [PATCH] raise MAX_SERVER_LIMIT
On Thu, Jan 15, 2004 at 10:49:43AM -0500, [EMAIL PROTECTED] wrote: -#define MAX_SERVER_LIMIT 2 +#define MAX_SERVER_LIMIT 10 dang! Committed a limit of 20. A couple of observations: * I don't think you could do this with an early 2.4 kernel on i386 because of eating up kernel memory with LDTs, assuming APR thinks it can support threads. Not sure about current kernels from popular distros. I'm running 2.6.1-mm2 now, and things are much much better, we got away with it - just about with 2.6.1 vanilla, but -mm2 has improved the stability a lot. Peaked at just over 18,000 with 2.6.1-mm2 so far and it was a lot more bearable. We were having major stability problems with 2.4, and as soon as we went to 2.6 we found out why - we were getting a lot more client requests than we thought, but they were queuing. We hit 20,000 within 2 days of going to 2.6.1-rc2. There were other changes co-incidental to that, like going to 12Gb of RAM, which certainly helped, so it's hard to narrow it down too much. * Should I assume you tried worker but it uses too much CPU? If so, is it a small percentage more or really really bad? I don't use worker because it still dumps an un-backtracable corefile within about 5 minutes for me. I still have no idea why, though plenty of corefiles. I havn't tried a serious analysis yet, becasue I've been moving house, but I hope to get to it soon. Moving to worker would be a good thing :) If I get time, I'll compile the forensic logging module and see if I can find if it has a request-specific trigger I can replicate for testing. I have worker running for months on end on the exact same platform with much lower yield rates, so I suspect the problem is only being triggered by the sheer volume of requests. -- Colm MacCárthaighPublic Key: [EMAIL PROTECTED]
[PATCH] raise MAX_SERVER_LIMIT
Not entirely serious, but today, we actually hit this, in production :) The hardware, a dual 2Ghz Xeon with 12Gb RAM with Linux 2.6.1-rc2 coped, and remained responsive. So 20,000 may no longer be outside the realms of what administrators reasonably desire to have. Index: server/mpm/prefork/prefork.c === RCS file: /home/cvspublic/httpd-2.0/server/mpm/prefork/prefork.c,v retrieving revision 1.286 diff -u -u -r1.286 prefork.c --- server/mpm/prefork/prefork.c1 Jan 2004 13:26:25 - 1.286 +++ server/mpm/prefork/prefork.c7 Jan 2004 21:24:55 - @@ -123,7 +123,7 @@ * some sort of compile-time limit to help catch typos. */ #ifndef MAX_SERVER_LIMIT -#define MAX_SERVER_LIMIT 2 +#define MAX_SERVER_LIMIT 10 #endif #ifndef HARD_THREAD_LIMIT -- Colm MacCárthaighPublic Key: [EMAIL PROTECTED]