Re: [PATCH] Bug 2680: ** helper errors after -k rotate
tor 2009-07-16 klockan 11:13 +1000 skrev Robert Collins: Note that with NTLM this is realistic - there's no need for multiple helpers as a single helper can serve many concurrent requests. No longer the case as we no longer do challenge reuses and the concurrent protocol is not implemented for stateful helpers, but for basic, digest and url rewriters it is.. Regards Henrik
Re: [PATCH] Bug 2680: ** helper errors after -k rotate
tor 2009-07-16 klockan 14:47 +1200 skrev Amos Jeffries: And the sad reality in squid-3 with _one_ helper: squid-*--1-helper---* winbindd [state1, state2] \---crash: all 2 of 1 helpers all RS pending state. You can't run NTLM with just one helper, not until both the following are true: 1. We implement the current request protocol for stateful helpers 2. Samba implements the same in ntlm_auth (iirc this has been done in Samab4) The reason to this is because the challenge is unique for each client (required by NTLMv2) and the default NTLM helper protocol only supports one challenge at a time. But Squid should not crash horribly only because you do, only complain about helper queue overload... Regards Henrik
Re: [PATCH] Bug 2680: ** helper errors after -k rotate
On Thu, 2009-07-16 at 08:42 +0200, Henrik Nordstrom wrote: tor 2009-07-16 klockan 14:47 +1200 skrev Amos Jeffries: And the sad reality in squid-3 with _one_ helper: squid-*--1-helper---* winbindd [state1, state2] \---crash: all 2 of 1 helpers all RS pending state. You can't run NTLM with just one helper, not until both the following are true: 1. We implement the current request protocol for stateful helpers I thought we put a unique id on the stateful helper protocol _way_ back. Sigh :( 2. Samba implements the same in ntlm_auth (iirc this has been done in Samab4) We'd need to ask, I can do that if we think it hasn't been done. As samba supplies the auth helpers there shouldn't be a compatibility issue in changing it. -Rob signature.asc Description: This is a digitally signed message part
Re: [PATCH] Bug 2680: ** helper errors after -k rotate
tor 2009-07-16 klockan 16:54 +1000 skrev Robert Collins: I thought we put a unique id on the stateful helper protocol _way_ back. Sigh :( No, I couldn't when I implemented the concurent request protocol so it only got done for stateless helpers. iirc I got stuck in trying to unwind some of the dependencies in the reservation code used by the challenge reuse layer (which is thrown out these days..). Regards Henrik
[PATCH] Bug 2680: ** helper errors after -k rotate
Amos Jeffries wrote: Robert Collins wrote: On Wed, 2009-07-15 at 00:52 +0200, Henrik Nordstrom wrote: Or 4, go back to don't strictly enforce the number of helpers? +1 I don't know what this strictly-enforce thing is, but it sounds unneeded as we used to fire up the right number of helpers anyway. I stopped Squid saying: running: 200 of 100 XXX helpers A Henrik said, people with large memory-hog helpers have issues when Squid allocates more than N bunches of their carefully tuned available memory to its helpers. This is also important in low-memory systems requiring auth. It's a simple 'start N' call now checks the number of running helpers before blindly starting new ones. Making Squid actually follow its numerous children=N settings. I'm fine with reverting it in 3.1. But this is a nasty mix of sync and async operations that does need cleaning up in 3.2. It's semi-hiding about 4 bugs in a helpers and auth. Right, following up on that I had a hunch and discovered that n_active is already performing this duty. The bug is therefore in my earlier fix using n_running as a base instead of n_active. How does this new attached patch look to you two? It performs the calculation re-base and documents the relevant counters. Amos === modified file 'src/helper.cc' --- src/helper.cc 2009-06-14 12:35:17 + +++ src/helper.cc 2009-07-16 10:53:22 + @@ -103,7 +103,7 @@ shortname = xstrdup(progname); /* dont ever start more than hlp-n_to_start processes. */ -int need_new = hlp-n_to_start - hlp-n_running; +int need_new = hlp-n_to_start - hlp-n_active; debugs(84, 1, helperOpenServers: Starting need_new / hlp-n_to_start ' shortname ' processes); @@ -209,7 +209,8 @@ shortname = xstrdup(progname); /* dont ever start more than hlp-n_to_start processes. */ -int need_new = hlp-n_to_start - hlp-n_running; +/* n_active are the helpers which have not been shut down. */ +int need_new = hlp-n_to_start - hlp-n_active; debugs(84, 1, helperOpenServers: Starting need_new / hlp-n_to_start ' shortname ' processes); @@ -545,8 +546,8 @@ storeAppendPrintf(sentry, program: %s\n, hlp-cmdline-key); -storeAppendPrintf(sentry, number running: %d of %d\n, - hlp-n_running, hlp-n_to_start); +storeAppendPrintf(sentry, number active: %d of %d (%d shutting down)\n, + hlp-n_active, hlp-n_to_start, (hlp-n_running - hlp-n_active) ); storeAppendPrintf(sentry, requests sent: %d\n, hlp-stats.requests); storeAppendPrintf(sentry, replies received: %d\n, @@ -587,7 +588,7 @@ storeAppendPrintf(sentry,B = BUSY\n); storeAppendPrintf(sentry,W = WRITING\n); storeAppendPrintf(sentry,C = CLOSING\n); -storeAppendPrintf(sentry,S = SHUTDOWN\n); +storeAppendPrintf(sentry,S = SHUTDOWN PENDING\n); } void @@ -598,8 +599,8 @@ storeAppendPrintf(sentry, program: %s\n, hlp-cmdline-key); -storeAppendPrintf(sentry, number running: %d of %d\n, - hlp-n_running, hlp-n_to_start); +storeAppendPrintf(sentry, number active: %d of %d (%d shutting down)\n, + hlp-n_active, hlp-n_to_start, (hlp-n_running - hlp-n_active) ); storeAppendPrintf(sentry, requests sent: %d\n, hlp-stats.requests); storeAppendPrintf(sentry, replies received: %d\n, @@ -644,7 +645,7 @@ storeAppendPrintf(sentry,B = BUSY\n); storeAppendPrintf(sentry,C = CLOSING\n); storeAppendPrintf(sentry,R = RESERVED or DEFERRED\n); -storeAppendPrintf(sentry,S = SHUTDOWN\n); +storeAppendPrintf(sentry,S = SHUTDOWN PENDING\n); storeAppendPrintf(sentry,P = PLACEHOLDER\n); } @@ -671,7 +672,6 @@ hlp-n_active--; assert(hlp-n_active = 0); - srv-flags.shutdown = 1; /* request it to shut itself down */ if (srv-flags.closing) { === modified file 'src/helper.h' --- src/helper.h 2009-04-07 13:16:59 + +++ src/helper.h 2009-07-16 10:51:42 + @@ -58,9 +58,9 @@ dlink_list servers; dlink_list queue; const char *id_name; -int n_to_start; -int n_running; -int n_active; +int n_to_start; /// Configuration setting of how many helper children should be running +int n_running;/// Total helper children objects currently existing +int n_active; /// Count of helper children active (not shutting down) int ipc_type; IpAddress addr; unsigned int concurrency; @@ -80,9 +80,9 @@ dlink_list servers; dlink_list queue; const char *id_name; -int n_to_start; -int n_running; -int n_active; +int n_to_start; /// Configuration setting of how many helper children should be running +int n_running;/// Total helper children objects currently
Re: [PATCH] Bug 2680: ** helper errors after -k rotate
tor 2009-07-16 klockan 23:15 +1200 skrev Amos Jeffries: Right, following up on that I had a hunch and discovered that n_active is already performing this duty. The bug is therefore in my earlier fix using n_running as a base instead of n_active. How does this new attached patch look to you two? Looks fine to me, assuming rotate still starts a new set of helpers (wasn't clear to me). Will fix the too many helpers have crashed, need to start new ones case, and leave the rotate reconfigure case alone. With this we can also start new helpers more or less immediately (at some seconds delay just in case) when seeing them crash... Regards Henrik
Re: [PATCH] Bug 2680: ** helper errors after -k rotate
On Thu, 2009-07-16 at 09:11 +0200, Henrik Nordstrom wrote: tor 2009-07-16 klockan 16:54 +1000 skrev Robert Collins: I thought we put a unique id on the stateful helper protocol _way_ back. Sigh :( No, I couldn't when I implemented the concurent request protocol so it only got done for stateless helpers. iirc I got stuck in trying to unwind some of the dependencies in the reservation code used by the challenge reuse layer (which is thrown out these days..). Well thats hopeful, sounds like it would be easier now :) That reuse stuff was such a hack. -Rob signature.asc Description: This is a digitally signed message part
Re: [PATCH] Bug 2680: ** helper errors after -k rotate
Henrik Nordstrom wrote: tor 2009-07-16 klockan 23:15 +1200 skrev Amos Jeffries: Right, following up on that I had a hunch and discovered that n_active is already performing this duty. The bug is therefore in my earlier fix using n_running as a base instead of n_active. How does this new attached patch look to you two? Looks fine to me, assuming rotate still starts a new set of helpers (wasn't clear to me). Yes, no longer touching mainRotate() at all with these. Will fix the too many helpers have crashed, need to start new ones case, and leave the rotate reconfigure case alone. With this we can also start new helpers more or less immediately (at some seconds delay just in case) when seeing them crash... Ah, good point. :) Amos -- Please be using Current Stable Squid 2.7.STABLE6 or 3.0.STABLE16 Current Beta Squid 3.1.0.9
Re: [PATCH] Bug 2680: ** helper errors after -k rotate
Robert Collins wrote: On Wed, 2009-07-15 at 00:52 +0200, Henrik Nordstrom wrote: Or 4, go back to don't strictly enforce the number of helpers? +1 I don't know what this strictly-enforce thing is, but it sounds unneeded as we used to fire up the right number of helpers anyway. I stopped Squid saying: running: 200 of 100 XXX helpers A Henrik said, people with large memory-hog helpers have issues when Squid allocates more than N bunches of their carefully tuned available memory to its helpers. This is also important in low-memory systems requiring auth. It's a simple 'start N' call now checks the number of running helpers before blindly starting new ones. Making Squid actually follow its numerous children=N settings. I'm fine with reverting it in 3.1. But this is a nasty mix of sync and async operations that does need cleaning up in 3.2. It's semi-hiding about 4 bugs in a helpers and auth. Amos -- Please be using Current Stable Squid 2.7.STABLE6 or 3.0.STABLE16 Current Beta Squid 3.1.0.9
Re: [PATCH] Bug 2680: ** helper errors after -k rotate
Bundle Buggy has detected this merge request. For details, see: http://bundlebuggy.aaronbentley.com/project/squid/request/%3C4A584880.2080707%40treenet.co.nz%3E Project: Squid
Re: [PATCH] Bug 2680: ** helper errors after -k rotate
On Thu, 2009-07-16 at 14:08 +1200, Amos Jeffries wrote: Both reconfigure and helper recovery use startHelpers() where the limit needs to take place. The DOS bug fix broke *rotate* (reconfigure has an async step added by Alex that prevents it being a problem). s/rotate/reconfigure then :) In my mind one is a subset of the other. If someone is running hundreds of helpers on openwrt/olpc then things are broken already :). I'd really suggest that such environments pipeline through a single helper rather than many concurrent helpers. Such platorms are single core and you'll get better usage of memory doing many requests in a single helper than one request each to many helpers. lol, NTLM concurrent? try it! I did. IIRC the winbindd is fully capable of handling multiple overlapping requests, and each NTLM helper is *solely* a thunk layer between squid's format and the winbindd *state*. ASCII art time, 3 requests: Multiple helpers: /--1-helper--\ squid-*---2-helper---* winbindd [state1, state2, state3] \--3-helper--/ One helper: squid-*--1-helper---* winbindd [state1, state2, state3] -Rob signature.asc Description: This is a digitally signed message part
Re: [PATCH] Bug 2680: ** helper errors after -k rotate
NOte that winbind has a hard coded limit that is by default very low. Opening 2n ntlm_auth helpers may make things blow up in horrible ways. Adrian 2009/7/16 Robert Collins robe...@robertcollins.net: On Thu, 2009-07-16 at 14:08 +1200, Amos Jeffries wrote: Both reconfigure and helper recovery use startHelpers() where the limit needs to take place. The DOS bug fix broke *rotate* (reconfigure has an async step added by Alex that prevents it being a problem). s/rotate/reconfigure then :) In my mind one is a subset of the other. If someone is running hundreds of helpers on openwrt/olpc then things are broken already :). I'd really suggest that such environments pipeline through a single helper rather than many concurrent helpers. Such platorms are single core and you'll get better usage of memory doing many requests in a single helper than one request each to many helpers. lol, NTLM concurrent? try it! I did. IIRC the winbindd is fully capable of handling multiple overlapping requests, and each NTLM helper is *solely* a thunk layer between squid's format and the winbindd *state*. ASCII art time, 3 requests: Multiple helpers: /--1-helper--\ squid-*---2-helper---* winbindd [state1, state2, state3] \--3-helper--/ One helper: squid-*--1-helper---* winbindd [state1, state2, state3] -Rob
Re: [PATCH] Bug 2680: ** helper errors after -k rotate
On Thu, 16 Jul 2009 12:19:55 +1000, Robert Collins robe...@robertcollins.net wrote: On Thu, 2009-07-16 at 14:08 +1200, Amos Jeffries wrote: Both reconfigure and helper recovery use startHelpers() where the limit needs to take place. The DOS bug fix broke *rotate* (reconfigure has an async step added by Alex that prevents it being a problem). s/rotate/reconfigure then :) In my mind one is a subset of the other. If someone is running hundreds of helpers on openwrt/olpc then things are broken already :). I'd really suggest that such environments pipeline through a single helper rather than many concurrent helpers. Such platorms are single core and you'll get better usage of memory doing many requests in a single helper than one request each to many helpers. lol, NTLM concurrent? try it! I did. IIRC the winbindd is fully capable of handling multiple overlapping requests, and each NTLM helper is *solely* a thunk layer between squid's format and the winbindd *state*. ASCII art time, 3 requests: Multiple helpers: /--1-helper--\ squid-*---2-helper---* winbindd [state1, state2, state3] \--3-helper--/ One helper: squid-*--1-helper---* winbindd [state1, state2, state3] -Rob And the sad reality in squid-3 with _one_ helper: squid-*--1-helper---* winbindd [state1, state2] \---crash: all 2 of 1 helpers all RS pending state. But thats bug 2648. With a useful perl client script provided to replicate. Amos
Re: [PATCH] Bug 2680: ** helper errors after -k rotate
lör 2009-07-11 klockan 20:08 +1200 skrev Amos Jeffries: It's most visible in rotate because Squid is intended to keep running with a hot-swap of its logs. Previously the sequence was causing two full sets of helpers to be started, and a period of overlap before the async closure of the old set happened. Yes, this is what Squid-2 does. It schedules the existing set to be closed (no more new requests assigned to them), and starts a new set. The old set dies as soon as it's completed doing what it's doing at the moment.. (including statful associations). But now that we are enforcing the number of helpers started to prevent memory bloat the overlap prevents the new set from starting at all. Hmm.. why? haven't been much of a problem in all the years we have been doing this.. except possibly for some poorly configured SquidGuard setups with large text lists without db backends... The safety checks in helper itself which prevent live restart attempts are tuned to only do so on incremental helper deaths, not to wholesale 100% loss of helpers. Yes. Our options are: 1 don't shutdown helpers on rotate (this ha been another open bug) 2 make helpers code extremely aggressive to keep helpers running (at cost of being able to detect failures) 3 make mainRotate async Or 4, go back to don't strictly enforce the number of helpers? Is there something I've missed that means some helpers MUST be restarted on rotate? Helpers are restarted because their stderr is connected to cache.log in append mode. Regards Henrik
Re: [PATCH] Bug 2680: ** helper errors after -k rotate
On Wed, 2009-07-15 at 00:52 +0200, Henrik Nordstrom wrote: Or 4, go back to don't strictly enforce the number of helpers? +1 I don't know what this strictly-enforce thing is, but it sounds unneeded as we used to fire up the right number of helpers anyway. -Rob signature.asc Description: This is a digitally signed message part
[PATCH] Bug 2680: ** helper errors after -k rotate
http://www.squid-cache.org/bugs/show_bug.cgi?id=2680 I've tracked this one down to the helper shutdown being async and used from mainRotate() which is a sync operation. We are probably hitting the same issue on reconfigure and shutdown. It's most visible in rotate because Squid is intended to keep running with a hot-swap of its logs. Previously the sequence was causing two full sets of helpers to be started, and a period of overlap before the async closure of the old set happened. But now that we are enforcing the number of helpers started to prevent memory bloat the overlap prevents the new set from starting at all. The safety checks in helper itself which prevent live restart attempts are tuned to only do so on incremental helper deaths, not to wholesale 100% loss of helpers. Our options are: 1 don't shutdown helpers on rotate (this ha been another open bug) 2 make helpers code extremely aggressive to keep helpers running (at cost of being able to detect failures) 3 make mainRotate async (2) and (3) wil need to be done eventually, however as far as I can see there are only cons involved with keeping the status-quo of current rotate behavior: * extra work for Squid restarting stuff. * stateful helpers may be dropped mid-action leading to client errors. (from a simple log rotation!!) * longer time for the operation to complete back to a stable request processing state. Attached is a patch for those worst affected which removes the helpers from rotate sequence (option 1). I've checked and the only ways I can see at this point for helper to do any logging is via syslog or stderr relay back to Squid. Is there something I've missed that means some helpers MUST be restarted on rotate? Amos -- Please be using Current Stable Squid 2.7.STABLE6 or 3.0.STABLE16 Current Beta Squid 3.1.0.9 === modified file 'src/main.cc' --- src/main.cc 2009-06-28 08:11:27 + +++ src/main.cc 2009-07-11 07:31:13 + @@ -775,15 +775,6 @@ static void mainRotate(void) { -icmpEngine.Close(); -#if USE_DNSSERVERS - -dnsShutdown(); -#endif - -redirectShutdown(); -authenticateShutdown(); -externalAclShutdown(); _db_rotate_log(); /* cache.log */ storeDirWriteCleanLogs(1); storeLogRotate(); /* store.log */ @@ -791,19 +782,8 @@ useragentRotateLog(); /* useragent.log */ refererRotateLog(); /* referer.log */ #if WIP_FWD_LOG - fwdLogRotate(); #endif - -icmpEngine.Open(); -#if USE_DNSSERVERS - -dnsInit(); -#endif - -redirectInit(); -authenticateInit(Config.authConfiguration); -externalAclInit(); } static void