Re: [Bug 62044] shared memory segments are not found in global list, but appear to exist in kernel.

2018-02-02 Thread Mark Blackman


> On 2 Feb 2018, at 14:19, Jim Jagielski  wrote:
> 
> To be honest, I don't think we ever envisioned an actual environ
> where the config files change every hour and the server gracefully
> restarted... I think our working assumptions have been that actual
> config file changes are "rare", hence the number of modules that
> allow for "on-the-fly" reconfiguration which avoid the need for
> restarts.
> 
> So this is a nice "edge case"
> 

Think mass hosting of 1+ reverse-proxy front-ends all in the same Apache 
instance, with self-service updates to configs as well as a staging 
environment.The 24-hour cycle is like this..

1am: Full stop (SIGTERM) and start of Apache with all configurations, primarily 
to permit log file rotation.

Then, on the hour, any configuration changes requested will be made live by 
auto-generation of a giant 200k+ configuration then a HUP (not a USR1) signal 
to keep the same parent, but a bunch of fresh children. As these are mostly 
reverse proxies, we generate thousands of balancer and balancermember 
directives per configuration.

In the background, once a minute, a background process is always checking for 
responses and forcibly restarting Apache (SIGTERM then SIGKILL if necessary) if 
it doesn’t respond.

Finally, bear in mind that line number changes can occur merely because a new 
virtualhost was added ahead of a given virtualhost, so some kind of tracking 
UUID for a virtualhost based on whatever non-line-number properties is probably 
useful.

- Mark




Re: [Bug 62044] shared memory segments are not found in global list, but appear to exist in kernel.

2018-02-02 Thread Jim Jagielski
To be honest, I don't think we ever envisioned an actual environ
where the config files change every hour and the server gracefully
restarted... I think our working assumptions have been that actual
config file changes are "rare", hence the number of modules that
allow for "on-the-fly" reconfiguration which avoid the need for
restarts.

So this is a nice "edge case"

> On Feb 1, 2018, at 11:49 AM, Mark Blackman  wrote:
> 
> 
> 
>> On 1 Feb 2018, at 16:27, Yann Ylavic  wrote:
>> 
>>> On Thu, Feb 1, 2018 at 5:15 PM, Yann Ylavic  wrote:
>>> On Thu, Feb 1, 2018 at 4:32 PM, Mark Blackman  wrote:>
>>> 
 SHM clean-up is the key here and any patch that doesn’t contribute to
 that has no immediate value for me.
>>> 
>>> What you may want to try is remove "s->defn_line_number" from the id there:
>>> https://github.com/apache/httpd/blob/trunk/modules/proxy/mod_proxy_balancer.c#L787
>>> If your configuration file changes often, that contributes to changing
>>> the name of the SHM...
>> 
>> FWIW, here is (attached) the patch I'm thinking about.
>> 
> 
> Thanks, the configuration changes once an hour or so. Typically, we have 
> about 1000 active shared memory segments (yes, they are SHMs) attached to the 
> httpd processes.
> 
> For now, we’ll just have to implement a SHM clean-up in the start/stop 
> wrappers until we can address the root cause or find a cleaner mitigation, 
> which your patch might help with.
> 
> - Mark



Re: [Bug 62044] shared memory segments are not found in global list, but appear to exist in kernel.

2018-02-02 Thread Jim Jagielski
FWIW, the id is supposed to be somewhat unique if the config DOES
change, hence the use of the line number as part of the hash...
In other words, if the config file itself is changed, we want to
create a new id because we have no idea how to match up
the "old" config in shm and the "new" config that was just
reloaded, so we assume that the new config is the new
default and thus deserves/requires a new ID.

> On Feb 1, 2018, at 11:27 AM, Yann Ylavic  wrote:
> 
> 



Re: [Bug 62044] shared memory segments are not found in global list, but appear to exist in kernel.

2018-02-01 Thread Mark Blackman


> On 1 Feb 2018, at 16:27, Yann Ylavic  wrote:
> 
>> On Thu, Feb 1, 2018 at 5:15 PM, Yann Ylavic  wrote:
>> On Thu, Feb 1, 2018 at 4:32 PM, Mark Blackman  wrote:>
>> 
>>> SHM clean-up is the key here and any patch that doesn’t contribute to
>>> that has no immediate value for me.
>> 
>> What you may want to try is remove "s->defn_line_number" from the id there:
>> https://github.com/apache/httpd/blob/trunk/modules/proxy/mod_proxy_balancer.c#L787
>> If your configuration file changes often, that contributes to changing
>> the name of the SHM...
> 
> FWIW, here is (attached) the patch I'm thinking about.
> 

Thanks, the configuration changes once an hour or so. Typically, we have about 
1000 active shared memory segments (yes, they are SHMs) attached to the httpd 
processes.

For now, we’ll just have to implement a SHM clean-up in the start/stop wrappers 
until we can address the root cause or find a cleaner mitigation, which your 
patch might help with.

- Mark


Re: [Bug 62044] shared memory segments are not found in global list, but appear to exist in kernel.

2018-02-01 Thread Yann Ylavic
On Thu, Feb 1, 2018 at 5:15 PM, Yann Ylavic  wrote:
> On Thu, Feb 1, 2018 at 4:32 PM, Mark Blackman  wrote:>
>
>> SHM clean-up is the key here and any patch that doesn’t contribute to
>> that has no immediate value for me.
>
> What you may want to try is remove "s->defn_line_number" from the id there:
>  
> https://github.com/apache/httpd/blob/trunk/modules/proxy/mod_proxy_balancer.c#L787
> If your configuration file changes often, that contributes to changing
> the name of the SHM...

FWIW, here is (attached) the patch I'm thinking about.
Index: modules/proxy/mod_proxy_balancer.c
===
--- modules/proxy/mod_proxy_balancer.c	(revision 1822878)
+++ modules/proxy/mod_proxy_balancer.c	(working copy)
@@ -784,13 +784,12 @@ static int balancer_post_config(apr_pool_t *pconf,
  * During create_proxy_config() we created a dummy id. Now that
  * we have identifying info, we can create the real id
  */
-id = apr_psprintf(pconf, "%s.%s.%d.%s.%s.%u.%s",
+id = apr_psprintf(pconf, "%s.%s.%d.%s.%s.%s",
   (s->server_scheme ? s->server_scheme : ""),
   (s->server_hostname ? s->server_hostname : "???"),
   (int)s->port,
   (s->server_admin ? s->server_admin : "??"),
   (s->defn_name ? s->defn_name : "?"),
-  s->defn_line_number,
   (s->error_fname ? s->error_fname : DEFAULT_ERRORLOG));
 conf->id = apr_psprintf(pconf, "p%x",
 ap_proxy_hashfunc(id, PROXY_HASHFUNC_DEFAULT));


Re: [Bug 62044] shared memory segments are not found in global list, but appear to exist in kernel.

2018-02-01 Thread Yann Ylavic
On Thu, Feb 1, 2018 at 4:32 PM, Mark Blackman  wrote:>
>> On 1 Feb 2018, at 12:36, Yann Ylavic  wrote:
>>
>> Hi Mark,
>>
>> On Thu, Feb 1, 2018 at 10:29 AM, Mark Blackman 
>> wrote:>
>>>
>>>
>>> Just to confirm, you expect that patch to handle SHM clean-up
>>> even in the “nasty error” case?
>>
>> Not really, no patch can avoid a crash for a crashing code :/ The
>> "stop_signals-PR61558.patch" patch avoids a known httpd crash in
>> some circumstances, but...
>
> Well, I just mean, if sig_coredump gets called, will the patch result
> in the normal SHM clean-up routines getting called, where they would
> have not been called before?

No, unfortunately nothing fancy in there, keep in mind that it's a
root process faulting so I don't think much should ben done...

> SHM clean-up is the key here and any patch that doesn’t contribute to
> that has no immediate value for me.

What you may want to try is remove "s->defn_line_number" from the id there:
 
https://github.com/apache/httpd/blob/trunk/modules/proxy/mod_proxy_balancer.c#L787
If your configuration file changes often, that contributes to changing
the name of the SHM...

>
>>
>>> I suspect that nasty error is triggered by the Weblogic plugin
>>> based on the adjacency in the logs, but the tracing doesn’t
>>> reveal any details, so an strace will probably be required to get
>>> more detail.
>
> Tracing has confirmed this really is a segmentation fault despite the
> lack of host-level messages and that reading a 3rd party module (but
> not Weblogic) is the last thing that happens before the segmentation
> fault and that pattern is fairly consistent. Now we need to ensure
> coredumps are generated.
>
> Finally, there are no orphaned child httpd processes with a PPID of
> 1.  Just thousands and thousands of SHM segments with no processes
> attached to them.

Which brings us back to why attach and/or create fail if nothing is
attached to them.
These are SHMs (per "ipcs -m"), right? Not semaphores ("ipcs -s")?

"thousands and thousands" is kind of exponential, even for thousands
of vhosts, do the names of SHMs change for each startup?
(besides the generation number if you use that patch, I'm hardly
thinking that the processes would crash arbitrarily at generation
[0..1000]...)
If so, does it relate to configuration changes?

We are not talking about fixing the root issue here :/


Regards,
Yann.


Re: [Bug 62044] shared memory segments are not found in global list, but appear to exist in kernel.

2018-02-01 Thread Mark Blackman

> On 1 Feb 2018, at 12:36, Yann Ylavic  wrote:
> 
> Hi Mark,
> 
> On Thu, Feb 1, 2018 at 10:29 AM, Mark Blackman  wrote:>
>> 
>> 
>> Just to confirm, you expect that patch to handle SHM clean-up even in
>> the “nasty error” case?
> 
> Not really, no patch can avoid a crash for a crashing code :/
> The "stop_signals-PR61558.patch" patch avoids a known httpd crash in
> some circumstances, but...

Well, I just mean, if sig_coredump gets called, will the patch result in the 
normal SHM clean-up routines getting called, where they would have not been 
called before?  SHM clean-up is the key here and any patch that doesn’t 
contribute to that has no immediate value for me.

> 
>> I suspect that nasty error is triggered by
>> the Weblogic plugin based on the adjacency in the logs, but the
>> tracing doesn’t reveal any details, so an strace will probably be
>> required to get more detail.

Tracing has confirmed this really is a segmentation fault despite the lack of 
host-level messages and that reading a 3rd party module (but not Weblogic) is 
the last thing that happens before the segmentation fault and that pattern is 
fairly consistent. Now we need to ensure coredumps are generated.

Finally, there are no orphaned child httpd processes with a PPID of 1.  Just 
thousands and thousands of SHM segments with no processes attached to them.

Regards,
Mark


Re: [Bug 62044] shared memory segments are not found in global list, but appear to exist in kernel.

2018-02-01 Thread Yann Ylavic
Hi Mark,

On Thu, Feb 1, 2018 at 10:29 AM, Mark Blackman  wrote:>
> Thanks, for now, we will treat the “nasty error” as a separate
> question to resolve and hope that clean-up patch deals with the
> immediate issue.

OK, that patch can be discussed on bz if it doesn't turn too technical.
Technicals, (long) discussions and debugging is not very friendly for
future visitors of bz which may encounter the same issue to go to the
solution...

>
> I had originally treated that “nasty error” as a reference to the
> “file exists” error. However, based on your feedback and reviewing
> the logs, I would conclude that “nasty error” is the trigger, as you
> suggrest, and the lack of SHM clean-up and consequent collisions are
> collateral damage.

That's what I feel, but I wouldn't stake my life on it either :)

>
> Just to confirm, you expect that patch to handle SHM clean-up even in
> the “nasty error” case?

Not really, no patch can avoid a crash for a crashing code :/
The "stop_signals-PR61558.patch" patch avoids a known httpd crash in
some circumstances, but...

> I suspect that nasty error is triggered by
> the Weblogic plugin based on the adjacency in the logs, but the
> tracing doesn’t reveal any details, so an strace will probably be
> required to get more detail.

... if the crash is not related, that won't help.

I'm missing something in your scenario though.

In the original/non-patched code and still with the "generation
number" patch (aka "Jim's"), there is always an attempt to attach the
SHM first and only it that fails a new one is created.
It means that even if the parent process crashes without cleaning up
the SHM on the system, whether or not some children are still alive
when a new httpd instance is started, it should be able to attach the
SHM (create would fail, but not attach).
Btw, things would probably turn bad soon or later because
synchronization assumptions are off (old and new children wouldn't
share the same mutex which is not reused/attached on startup, global
mutexes leak in the system for that scenario more than SHMs).
So why both attach and create fails in your case?

With my proposed patch (r1822509), since I removed attach (bullet 4/
in the commit message), your scenario is "expected" to fail when the
second httpd instance starts (while old children are still alive).
I'm not sure I should fix this (re-introduce the attach code) because
as I said this is a screwy scenario with regard to the global mutex,
it's not supposed to work like this.
The only sane thing to do here (IMHO, and more a note to other httpd
devs) would be to kill children whenever the parent process dies
underneath them, be it with a startup script (there shouldn't be any
orphaned child process, at least when httpd starts), or natively in
the MPM which could detect this situation (that's another story
though, and it probably should be opt-in because it depends on how
httpd is started/monitored externally, and how much the user want the
service to continue as much as possible...).

So the faster/simpler solution *for you* might be to create/modify
your (re)startup script such that it kills orphaned children, if any,
in prevention...

>
> Bugzilla was slightly easier to get log data into as I cannot use
> work email for these conversations.

There is no strong statement/rule on bz vs dev@, if it's more
convenient for you to continue there this is a good reason ;)
I wouldn't go as far in the discussion as I did here, though (sorry if
it was too long btw).


Regards,
Yann.


Re: [Bug 62044] shared memory segments are not found in global list, but appear to exist in kernel.

2018-02-01 Thread Mark Blackman
On 31 Jan 2018, at 22:41, Yann Ylavic  wrote:
> 
> Hi Mark,
> 
> let's continue this debugging on dev@ if you don't mind..
> 
>> On Wed, Jan 31, 2018 at 10:15 PM,   wrote:
>> https://bz.apache.org/bugzilla/show_bug.cgi?id=62044
>> 
>> --- Comment #32 from m...@blackmans.org ---
>> so sig_coredump is being triggered by an unknown signal, multiple times a 
>> day.
>> It's not a segfault, nothing in /var/log/messages. That results in a bunch of
>> undeleted shared memory segments and probably some that will no longer be in
>> the global list, but still present in the kernel.
> 
> In 2.4.29, i.e. without patch [1], sig_coredump might be triggered by
> any signal received by httpd during a restart, and the signal handle
> crashes itself (double fault) so the process is forcibly SIGKILLed
> (presumably, no trace in /var/log/messages...).
> This was reported and discussed in [2], and seems to quite correspond
> to what you observe in your tests.
> 
> Moreover, if the parent process crashes nothing will delete the
> IPC-SysV SHMs (hence the leak in the system), while children processes
> may continue to be attached which prevents a new parent process to
> start (until children stop or are forcibly killed)...
> 
> When this happens, you should see non-root processes attached to PPID
> 1 (e.g. with "ps -ef"), "-f /path/to/httpd.conf" in the command line
> might help distinguish the different httpd instances to monitor
> processes.
> 
> If this is the case, you probably should try patch [1].
> If not, I can't explain why in httpd logs a process with a different
> PID appears after the SIGHUP, it must have been started
> (automatically?) after the previous one crashed.
> Here the generation number can't help, a new process always start at
> generation #0.
> 
> Regards,
> Yann.
> 
> [1] 
> https://svn.apache.org/repos/asf/httpd/httpd/patches/2.4.x/stop_signals-PR61558.patch
> [2] https://bz.apache.org/bugzilla/show_bug.cgi?id=61558

Thanks, for now, we will treat the “nasty error” as a separate question to 
resolve and hope that clean-up patch deals with the immediate issue.

I had originally treated that “nasty error” as a reference to the “file exists” 
error.  However, based on your feedback and reviewing the logs, I would 
conclude that “nasty error” is the trigger, as you suggrest, and the lack of 
SHM clean-up and consequent collisions are collateral damage.

Just to confirm, you expect that patch to handle SHM clean-up even in the 
“nasty error” case?  I suspect that nasty error is triggered by the Weblogic 
plugin based on the adjacency in the logs, but the tracing doesn’t reveal any 
details, so an strace will probably be required to get more detail.

Bugzilla was slightly easier to get log data into as I cannot use work email 
for these conversations.

Cheers,
Mark





Re: [Bug 62044] shared memory segments are not found in global list, but appear to exist in kernel.

2018-01-31 Thread Yann Ylavic
Hi Mark,

let's continue this debugging on dev@ if you don't mind..

On Wed, Jan 31, 2018 at 10:15 PM,   wrote:
> https://bz.apache.org/bugzilla/show_bug.cgi?id=62044
>
> --- Comment #32 from m...@blackmans.org ---
> so sig_coredump is being triggered by an unknown signal, multiple times a day.
> It's not a segfault, nothing in /var/log/messages. That results in a bunch of
> undeleted shared memory segments and probably some that will no longer be in
> the global list, but still present in the kernel.

In 2.4.29, i.e. without patch [1], sig_coredump might be triggered by
any signal received by httpd during a restart, and the signal handle
crashes itself (double fault) so the process is forcibly SIGKILLed
(presumably, no trace in /var/log/messages...).
This was reported and discussed in [2], and seems to quite correspond
to what you observe in your tests.

Moreover, if the parent process crashes nothing will delete the
IPC-SysV SHMs (hence the leak in the system), while children processes
may continue to be attached which prevents a new parent process to
start (until children stop or are forcibly killed)...

When this happens, you should see non-root processes attached to PPID
1 (e.g. with "ps -ef"), "-f /path/to/httpd.conf" in the command line
might help distinguish the different httpd instances to monitor
processes.

If this is the case, you probably should try patch [1].
If not, I can't explain why in httpd logs a process with a different
PID appears after the SIGHUP, it must have been started
(automatically?) after the previous one crashed.
Here the generation number can't help, a new process always start at
generation #0.

Regards,
Yann.

[1] 
https://svn.apache.org/repos/asf/httpd/httpd/patches/2.4.x/stop_signals-PR61558.patch
[2] https://bz.apache.org/bugzilla/show_bug.cgi?id=61558