Re: [Bug 62044] shared memory segments are not found in global list, but appear to exist in kernel.
> On 2 Feb 2018, at 14:19, Jim Jagielski wrote: > > To be honest, I don't think we ever envisioned an actual environ > where the config files change every hour and the server gracefully > restarted... I think our working assumptions have been that actual > config file changes are "rare", hence the number of modules that > allow for "on-the-fly" reconfiguration which avoid the need for > restarts. > > So this is a nice "edge case" > Think mass hosting of 1+ reverse-proxy front-ends all in the same Apache instance, with self-service updates to configs as well as a staging environment.The 24-hour cycle is like this.. 1am: Full stop (SIGTERM) and start of Apache with all configurations, primarily to permit log file rotation. Then, on the hour, any configuration changes requested will be made live by auto-generation of a giant 200k+ configuration then a HUP (not a USR1) signal to keep the same parent, but a bunch of fresh children. As these are mostly reverse proxies, we generate thousands of balancer and balancermember directives per configuration. In the background, once a minute, a background process is always checking for responses and forcibly restarting Apache (SIGTERM then SIGKILL if necessary) if it doesn’t respond. Finally, bear in mind that line number changes can occur merely because a new virtualhost was added ahead of a given virtualhost, so some kind of tracking UUID for a virtualhost based on whatever non-line-number properties is probably useful. - Mark
Re: [Bug 62044] shared memory segments are not found in global list, but appear to exist in kernel.
To be honest, I don't think we ever envisioned an actual environ where the config files change every hour and the server gracefully restarted... I think our working assumptions have been that actual config file changes are "rare", hence the number of modules that allow for "on-the-fly" reconfiguration which avoid the need for restarts. So this is a nice "edge case" > On Feb 1, 2018, at 11:49 AM, Mark Blackman wrote: > > > >> On 1 Feb 2018, at 16:27, Yann Ylavic wrote: >> >>> On Thu, Feb 1, 2018 at 5:15 PM, Yann Ylavic wrote: >>> On Thu, Feb 1, 2018 at 4:32 PM, Mark Blackman wrote:> >>> SHM clean-up is the key here and any patch that doesn’t contribute to that has no immediate value for me. >>> >>> What you may want to try is remove "s->defn_line_number" from the id there: >>> https://github.com/apache/httpd/blob/trunk/modules/proxy/mod_proxy_balancer.c#L787 >>> If your configuration file changes often, that contributes to changing >>> the name of the SHM... >> >> FWIW, here is (attached) the patch I'm thinking about. >> > > Thanks, the configuration changes once an hour or so. Typically, we have > about 1000 active shared memory segments (yes, they are SHMs) attached to the > httpd processes. > > For now, we’ll just have to implement a SHM clean-up in the start/stop > wrappers until we can address the root cause or find a cleaner mitigation, > which your patch might help with. > > - Mark
Re: [Bug 62044] shared memory segments are not found in global list, but appear to exist in kernel.
FWIW, the id is supposed to be somewhat unique if the config DOES change, hence the use of the line number as part of the hash... In other words, if the config file itself is changed, we want to create a new id because we have no idea how to match up the "old" config in shm and the "new" config that was just reloaded, so we assume that the new config is the new default and thus deserves/requires a new ID. > On Feb 1, 2018, at 11:27 AM, Yann Ylavic wrote: > >
Re: [Bug 62044] shared memory segments are not found in global list, but appear to exist in kernel.
> On 1 Feb 2018, at 16:27, Yann Ylavic wrote: > >> On Thu, Feb 1, 2018 at 5:15 PM, Yann Ylavic wrote: >> On Thu, Feb 1, 2018 at 4:32 PM, Mark Blackman wrote:> >> >>> SHM clean-up is the key here and any patch that doesn’t contribute to >>> that has no immediate value for me. >> >> What you may want to try is remove "s->defn_line_number" from the id there: >> https://github.com/apache/httpd/blob/trunk/modules/proxy/mod_proxy_balancer.c#L787 >> If your configuration file changes often, that contributes to changing >> the name of the SHM... > > FWIW, here is (attached) the patch I'm thinking about. > Thanks, the configuration changes once an hour or so. Typically, we have about 1000 active shared memory segments (yes, they are SHMs) attached to the httpd processes. For now, we’ll just have to implement a SHM clean-up in the start/stop wrappers until we can address the root cause or find a cleaner mitigation, which your patch might help with. - Mark
Re: [Bug 62044] shared memory segments are not found in global list, but appear to exist in kernel.
On Thu, Feb 1, 2018 at 5:15 PM, Yann Ylavic wrote: > On Thu, Feb 1, 2018 at 4:32 PM, Mark Blackman wrote:> > >> SHM clean-up is the key here and any patch that doesn’t contribute to >> that has no immediate value for me. > > What you may want to try is remove "s->defn_line_number" from the id there: > > https://github.com/apache/httpd/blob/trunk/modules/proxy/mod_proxy_balancer.c#L787 > If your configuration file changes often, that contributes to changing > the name of the SHM... FWIW, here is (attached) the patch I'm thinking about. Index: modules/proxy/mod_proxy_balancer.c === --- modules/proxy/mod_proxy_balancer.c (revision 1822878) +++ modules/proxy/mod_proxy_balancer.c (working copy) @@ -784,13 +784,12 @@ static int balancer_post_config(apr_pool_t *pconf, * During create_proxy_config() we created a dummy id. Now that * we have identifying info, we can create the real id */ -id = apr_psprintf(pconf, "%s.%s.%d.%s.%s.%u.%s", +id = apr_psprintf(pconf, "%s.%s.%d.%s.%s.%s", (s->server_scheme ? s->server_scheme : ""), (s->server_hostname ? s->server_hostname : "???"), (int)s->port, (s->server_admin ? s->server_admin : "??"), (s->defn_name ? s->defn_name : "?"), - s->defn_line_number, (s->error_fname ? s->error_fname : DEFAULT_ERRORLOG)); conf->id = apr_psprintf(pconf, "p%x", ap_proxy_hashfunc(id, PROXY_HASHFUNC_DEFAULT));
Re: [Bug 62044] shared memory segments are not found in global list, but appear to exist in kernel.
On Thu, Feb 1, 2018 at 4:32 PM, Mark Blackman wrote:> >> On 1 Feb 2018, at 12:36, Yann Ylavic wrote: >> >> Hi Mark, >> >> On Thu, Feb 1, 2018 at 10:29 AM, Mark Blackman >> wrote:> >>> >>> >>> Just to confirm, you expect that patch to handle SHM clean-up >>> even in the “nasty error” case? >> >> Not really, no patch can avoid a crash for a crashing code :/ The >> "stop_signals-PR61558.patch" patch avoids a known httpd crash in >> some circumstances, but... > > Well, I just mean, if sig_coredump gets called, will the patch result > in the normal SHM clean-up routines getting called, where they would > have not been called before? No, unfortunately nothing fancy in there, keep in mind that it's a root process faulting so I don't think much should ben done... > SHM clean-up is the key here and any patch that doesn’t contribute to > that has no immediate value for me. What you may want to try is remove "s->defn_line_number" from the id there: https://github.com/apache/httpd/blob/trunk/modules/proxy/mod_proxy_balancer.c#L787 If your configuration file changes often, that contributes to changing the name of the SHM... > >> >>> I suspect that nasty error is triggered by the Weblogic plugin >>> based on the adjacency in the logs, but the tracing doesn’t >>> reveal any details, so an strace will probably be required to get >>> more detail. > > Tracing has confirmed this really is a segmentation fault despite the > lack of host-level messages and that reading a 3rd party module (but > not Weblogic) is the last thing that happens before the segmentation > fault and that pattern is fairly consistent. Now we need to ensure > coredumps are generated. > > Finally, there are no orphaned child httpd processes with a PPID of > 1. Just thousands and thousands of SHM segments with no processes > attached to them. Which brings us back to why attach and/or create fail if nothing is attached to them. These are SHMs (per "ipcs -m"), right? Not semaphores ("ipcs -s")? "thousands and thousands" is kind of exponential, even for thousands of vhosts, do the names of SHMs change for each startup? (besides the generation number if you use that patch, I'm hardly thinking that the processes would crash arbitrarily at generation [0..1000]...) If so, does it relate to configuration changes? We are not talking about fixing the root issue here :/ Regards, Yann.
Re: [Bug 62044] shared memory segments are not found in global list, but appear to exist in kernel.
> On 1 Feb 2018, at 12:36, Yann Ylavic wrote: > > Hi Mark, > > On Thu, Feb 1, 2018 at 10:29 AM, Mark Blackman wrote:> >> >> >> Just to confirm, you expect that patch to handle SHM clean-up even in >> the “nasty error” case? > > Not really, no patch can avoid a crash for a crashing code :/ > The "stop_signals-PR61558.patch" patch avoids a known httpd crash in > some circumstances, but... Well, I just mean, if sig_coredump gets called, will the patch result in the normal SHM clean-up routines getting called, where they would have not been called before? SHM clean-up is the key here and any patch that doesn’t contribute to that has no immediate value for me. > >> I suspect that nasty error is triggered by >> the Weblogic plugin based on the adjacency in the logs, but the >> tracing doesn’t reveal any details, so an strace will probably be >> required to get more detail. Tracing has confirmed this really is a segmentation fault despite the lack of host-level messages and that reading a 3rd party module (but not Weblogic) is the last thing that happens before the segmentation fault and that pattern is fairly consistent. Now we need to ensure coredumps are generated. Finally, there are no orphaned child httpd processes with a PPID of 1. Just thousands and thousands of SHM segments with no processes attached to them. Regards, Mark
Re: [Bug 62044] shared memory segments are not found in global list, but appear to exist in kernel.
Hi Mark, On Thu, Feb 1, 2018 at 10:29 AM, Mark Blackman wrote:> > Thanks, for now, we will treat the “nasty error” as a separate > question to resolve and hope that clean-up patch deals with the > immediate issue. OK, that patch can be discussed on bz if it doesn't turn too technical. Technicals, (long) discussions and debugging is not very friendly for future visitors of bz which may encounter the same issue to go to the solution... > > I had originally treated that “nasty error” as a reference to the > “file exists” error. However, based on your feedback and reviewing > the logs, I would conclude that “nasty error” is the trigger, as you > suggrest, and the lack of SHM clean-up and consequent collisions are > collateral damage. That's what I feel, but I wouldn't stake my life on it either :) > > Just to confirm, you expect that patch to handle SHM clean-up even in > the “nasty error” case? Not really, no patch can avoid a crash for a crashing code :/ The "stop_signals-PR61558.patch" patch avoids a known httpd crash in some circumstances, but... > I suspect that nasty error is triggered by > the Weblogic plugin based on the adjacency in the logs, but the > tracing doesn’t reveal any details, so an strace will probably be > required to get more detail. ... if the crash is not related, that won't help. I'm missing something in your scenario though. In the original/non-patched code and still with the "generation number" patch (aka "Jim's"), there is always an attempt to attach the SHM first and only it that fails a new one is created. It means that even if the parent process crashes without cleaning up the SHM on the system, whether or not some children are still alive when a new httpd instance is started, it should be able to attach the SHM (create would fail, but not attach). Btw, things would probably turn bad soon or later because synchronization assumptions are off (old and new children wouldn't share the same mutex which is not reused/attached on startup, global mutexes leak in the system for that scenario more than SHMs). So why both attach and create fails in your case? With my proposed patch (r1822509), since I removed attach (bullet 4/ in the commit message), your scenario is "expected" to fail when the second httpd instance starts (while old children are still alive). I'm not sure I should fix this (re-introduce the attach code) because as I said this is a screwy scenario with regard to the global mutex, it's not supposed to work like this. The only sane thing to do here (IMHO, and more a note to other httpd devs) would be to kill children whenever the parent process dies underneath them, be it with a startup script (there shouldn't be any orphaned child process, at least when httpd starts), or natively in the MPM which could detect this situation (that's another story though, and it probably should be opt-in because it depends on how httpd is started/monitored externally, and how much the user want the service to continue as much as possible...). So the faster/simpler solution *for you* might be to create/modify your (re)startup script such that it kills orphaned children, if any, in prevention... > > Bugzilla was slightly easier to get log data into as I cannot use > work email for these conversations. There is no strong statement/rule on bz vs dev@, if it's more convenient for you to continue there this is a good reason ;) I wouldn't go as far in the discussion as I did here, though (sorry if it was too long btw). Regards, Yann.
Re: [Bug 62044] shared memory segments are not found in global list, but appear to exist in kernel.
On 31 Jan 2018, at 22:41, Yann Ylavic wrote: > > Hi Mark, > > let's continue this debugging on dev@ if you don't mind.. > >> On Wed, Jan 31, 2018 at 10:15 PM, wrote: >> https://bz.apache.org/bugzilla/show_bug.cgi?id=62044 >> >> --- Comment #32 from m...@blackmans.org --- >> so sig_coredump is being triggered by an unknown signal, multiple times a >> day. >> It's not a segfault, nothing in /var/log/messages. That results in a bunch of >> undeleted shared memory segments and probably some that will no longer be in >> the global list, but still present in the kernel. > > In 2.4.29, i.e. without patch [1], sig_coredump might be triggered by > any signal received by httpd during a restart, and the signal handle > crashes itself (double fault) so the process is forcibly SIGKILLed > (presumably, no trace in /var/log/messages...). > This was reported and discussed in [2], and seems to quite correspond > to what you observe in your tests. > > Moreover, if the parent process crashes nothing will delete the > IPC-SysV SHMs (hence the leak in the system), while children processes > may continue to be attached which prevents a new parent process to > start (until children stop or are forcibly killed)... > > When this happens, you should see non-root processes attached to PPID > 1 (e.g. with "ps -ef"), "-f /path/to/httpd.conf" in the command line > might help distinguish the different httpd instances to monitor > processes. > > If this is the case, you probably should try patch [1]. > If not, I can't explain why in httpd logs a process with a different > PID appears after the SIGHUP, it must have been started > (automatically?) after the previous one crashed. > Here the generation number can't help, a new process always start at > generation #0. > > Regards, > Yann. > > [1] > https://svn.apache.org/repos/asf/httpd/httpd/patches/2.4.x/stop_signals-PR61558.patch > [2] https://bz.apache.org/bugzilla/show_bug.cgi?id=61558 Thanks, for now, we will treat the “nasty error” as a separate question to resolve and hope that clean-up patch deals with the immediate issue. I had originally treated that “nasty error” as a reference to the “file exists” error. However, based on your feedback and reviewing the logs, I would conclude that “nasty error” is the trigger, as you suggrest, and the lack of SHM clean-up and consequent collisions are collateral damage. Just to confirm, you expect that patch to handle SHM clean-up even in the “nasty error” case? I suspect that nasty error is triggered by the Weblogic plugin based on the adjacency in the logs, but the tracing doesn’t reveal any details, so an strace will probably be required to get more detail. Bugzilla was slightly easier to get log data into as I cannot use work email for these conversations. Cheers, Mark
Re: [Bug 62044] shared memory segments are not found in global list, but appear to exist in kernel.
Hi Mark, let's continue this debugging on dev@ if you don't mind.. On Wed, Jan 31, 2018 at 10:15 PM, wrote: > https://bz.apache.org/bugzilla/show_bug.cgi?id=62044 > > --- Comment #32 from m...@blackmans.org --- > so sig_coredump is being triggered by an unknown signal, multiple times a day. > It's not a segfault, nothing in /var/log/messages. That results in a bunch of > undeleted shared memory segments and probably some that will no longer be in > the global list, but still present in the kernel. In 2.4.29, i.e. without patch [1], sig_coredump might be triggered by any signal received by httpd during a restart, and the signal handle crashes itself (double fault) so the process is forcibly SIGKILLed (presumably, no trace in /var/log/messages...). This was reported and discussed in [2], and seems to quite correspond to what you observe in your tests. Moreover, if the parent process crashes nothing will delete the IPC-SysV SHMs (hence the leak in the system), while children processes may continue to be attached which prevents a new parent process to start (until children stop or are forcibly killed)... When this happens, you should see non-root processes attached to PPID 1 (e.g. with "ps -ef"), "-f /path/to/httpd.conf" in the command line might help distinguish the different httpd instances to monitor processes. If this is the case, you probably should try patch [1]. If not, I can't explain why in httpd logs a process with a different PID appears after the SIGHUP, it must have been started (automatically?) after the previous one crashed. Here the generation number can't help, a new process always start at generation #0. Regards, Yann. [1] https://svn.apache.org/repos/asf/httpd/httpd/patches/2.4.x/stop_signals-PR61558.patch [2] https://bz.apache.org/bugzilla/show_bug.cgi?id=61558