Re: httpd stops accepting connections after a few hours on current

2015-07-16 Thread Tor Houghton
On Wed, Jul 15, 2015 at 11:40:31PM +0930, Jack Burton wrote:
> > 
> > Post it to tech@ .
> 
> Done. See post to tech@ titled "httpd: patch to close TLS sockets that
> fail before TLS handshake".
> 

Thanks; the fd's on my deployment (5.7-stable) have been stable since
application of this patch 12 hrs ago.

Tor



Re: httpd stops accepting connections after a few hours on current

2015-07-15 Thread Ax0n
Possibly related, I did have an http-only httpd exhibit similar behavior
after running for about 5 days.  httpd processes were running, but port 80
was wholly unresponsive. I attempted to find a way to grab a core from the
processes, but I couldn't figure it out. I had to get the sites back up and
running, so just did a restart on Sunday.

If there's anything I can do in preparation for it to happen again (debug,
etc) I would be happy to help in any way I can. I've been lurking on this
thread since it popped up (which was before my own httpd had problems) but
it doesn't seem like the TLS patch that you are talking about will have any
impact for my setup.

If it helps, included is my (ugly but functioning) httpd.conf:

ext_if="egress"
types { include "/usr/share/misc/mime.types" }
server "focushacks.com" {
listen on $ext_if port 80
alias "www.focushacks.com"
directory {index "index.html" }
directory {index "index.php" }
log access focushacks.com-access.log
log error focushacks.com-error.log
location "/.ht*"   { block }
location "/.git*"  { block }
location "/.svn*"  { block }
location "/*.php*" {
root { "/docs/www.focushacks.com" }
fastcgi socket "/run/php-fpm.sock"
}
location "/list*" {
root { "/docs/www.focushacks.com" }
fastcgi socket "/run/php-fpm.sock"
}
location "/news*" {
root { "/docs/www.focushacks.com" }
fastcgi socket "/run/php-fpm.sock"
}
location "/search*" {
root { "/docs/www.focushacks.com" }
fastcgi socket "/run/php-fpm.sock"
}
location "/mod*" {
root { "/docs/www.focushacks.com" }
fastcgi socket "/run/php-fpm.sock"
}
location "/*" {
root { "/docs/www.focushacks.com" }
}
}
server "kcmesh.h-i-r.net" {
listen on $ext_if port 80
alias "www.km3sh.com"
alias "km3sh.com"
directory {index "index.php" }
location "/*.php*" {
root { "/docs/kcmesh.h-i-r.net" }
fastcgi socket "/run/php-fpm.sock"
}
location "/*" {
root { "/docs/kcmesh.h-i-r.net" }
}
}
server "things.h-i-r.net" {
listen on $ext_if port 80
directory {index "index.php" }
location "/*.php*" {
root { "/docs/things.h-i-r.net" }
fastcgi socket "/run/php-fpm.sock"
}
location "/*" {
root { "/docs/things.h-i-r.net" }
}
}
server "h-i-r.net" {
listen on $ext_if port 80
directory {index "index.php" }
location "/*.php*" {
root { "/docs/h-i-r.net" }
fastcgi socket "/run/php-fpm.sock"
}
location "/*" {
root { "/docs/h-i-r.net" }
}
}


On Wed, Jul 15, 2015 at 9:10 AM, Jack Burton  wrote:

> On Wed, 2015-07-15 at 12:56 +, Mike Burns wrote:
> > On 2015-07-15 21.49.11 +0930, Jack Burton wrote:
> > > Sorry, didn't realise I couldn't post a patch to the misc@ (I've never
> > > needed to before).
> > >
> > > Please excuse my ignorance, but what is the accepted way to contribute
> a
> > > patch?
> >
> > Post it to tech@ .
>
> Done. See post to tech@ titled "httpd: patch to close TLS sockets that
> fail before TLS handshake".



Re: httpd stops accepting connections after a few hours on current

2015-07-15 Thread Jack Burton
On Wed, 2015-07-15 at 12:56 +, Mike Burns wrote: 
> On 2015-07-15 21.49.11 +0930, Jack Burton wrote:
> > Sorry, didn't realise I couldn't post a patch to the misc@ (I've never
> > needed to before).
> > 
> > Please excuse my ignorance, but what is the accepted way to contribute a
> > patch?
> 
> Post it to tech@ .

Done. See post to tech@ titled "httpd: patch to close TLS sockets that
fail before TLS handshake".



Re: httpd stops accepting connections after a few hours on current

2015-07-15 Thread ludovic coues
2015-07-15 14:56 GMT+02:00 Mike Burns :
> On 2015-07-15 21.49.11 +0930, Jack Burton wrote:
>> Sorry, didn't realise I couldn't post a patch to the misc@ (I've never
>> needed to before).
>>
>> Please excuse my ignorance, but what is the accepted way to contribute a
>> patch?
>
> Post it to tech@ .
>

Also, post patch in unified format and not as attachment but inline.

-- 

Cordialement, Coues Ludovic
+336 148 743 42



Re: httpd stops accepting connections after a few hours on current

2015-07-15 Thread Mike Burns
On 2015-07-15 21.49.11 +0930, Jack Burton wrote:
> Sorry, didn't realise I couldn't post a patch to the misc@ (I've never
> needed to before).
> 
> Please excuse my ignorance, but what is the accepted way to contribute a
> patch?

Post it to tech@ .



Re: httpd stops accepting connections after a few hours on current

2015-07-15 Thread Jack Burton
On Wed, 2015-07-15 at 21:41 +0930, Jack Burton wrote: 
> The fix is trivial -- see attached patch (against 5.7-stable -- sorry,
> I don't have any hosts running -current at present).
<...> 
> [demime 1.01d removed an attachment of type text/x-patch which had a
> name of httpd_server_accept_tls.patch"; charset="UTF-8]

Sorry, didn't realise I couldn't post a patch to the misc@ (I've never
needed to before).

Please excuse my ignorance, but what is the accepted way to contribute a
patch?



Re: httpd stops accepting connections after a few hours on current

2015-07-15 Thread Jack Burton
On Mon, 2015-07-13 at 16:19 +0200, Tor Houghton wrote: 
> On Mon, Jul 13, 2015 at 10:52:46PM +0930, Jack Burton wrote:
> > > 
> > > I don't pretend to know httpd (at all), but I'm wondering, what should
> > > fstat(1) say, over time, for the httpd processes?
> > 
> > Thanks Tor -- that was exactly the clue I needed to isolate the
> > problem.
> > 
> > [snip]
> >
> > admin talks to a custom FastCGI daemon, which is most likely the culprit
> > -- I'll debug it tomorrow.
<...> 
> 
> I am not sure you should conclude yet. I don't use FastCGI. ;-}
> 
> Now, as I write, I have 218 open fd's, compared to the 206 or whatever I had
> in my previous post. I've got a few "dangling" :443 streams (the :80 ones
> seem to disappear like they should), and then a bunch of these:

You're absolutely right -- I spoke too soon.

After double-checking that every possible path a request could take
through the custom FastCGI daemon used by admin ends by sending an
FCGI_END_REQUEST record back to httpd (it does), I turned my attention
back to the httpd logs & debug messages gathered.

This time I had my little script check the remote IP addresses of those
socket against all the httpd access logs (not just the current ones) and
where nothing matched there, finally check the httpd debug output too.

Again, only the admin server (the only one here that's Internet-facing)
had stale sockets (all open sockets for redir & portal matched log
entries) -- out of 26 open sockets, 4 matched log entries for current
HTTPS sessions, 2 matched "buffer event error" debug messages and the
other 20 didn't match in either the logs or debug messages.

I still don't know what's causing the buffer event error messages, but
as they accounted for only 2 of the 22 stale sockets, I figured it was
more important to focus on the other 20 first.

So, what sort of HTTPS event doesn't make it into the logs and doesn't
cause any debug messages containing the remote IP address to be emitted
either?

The only thing I could think of was a TCP connection to port 443 where
the remote end doesn't initiate a TLS handshake (that's nowhere near as
improbable as it sounds: think a simple port scan, or a network outage
commencing directly after the first ACK).

So, as a test I tried just that: establishing a TCP session from a
remote host then closing it without sending anything at all at layer 5.

Naturally, doing that where httpd expects plain HTTP causes only a
single debug message to be emitted ("...done"), and the socket gets
closed as expected.

But doing it where httpd expects HTTPS and the local side of the socket
remains open, nothing appears in the regular logs, and nothing
identifiable by remote IP address appears in debug messages either.

Trying to match log/debug entries that aren't identified by the remote
IP address on a host with even a modest amount of traffic struck me as
an exercise in futility, so I tried the same experiment on another host
(also running 5.7-stable) with no other load on httpd at all.

Result was the same: httpd did not close the socket or log anything in
the regular logs. However, one debug message was emitted, our old friend
"server_accept_tls: TLS accept failed - (null)"...

...which brings us right back to where this thread started.

Looking at the source, server_accept_tls() handles two types of
non-recoverable error condition: timeout after retry and outright
failure. In the first case (EV_TIMEOUT), server_accept_tls() calls
server_close() (which in turn calls server_close_http(), which closes
the socket) before returning; in the second case it does not.

I believe this is the bug we've been looking for.

The fix is trivial -- see attached patch (against 5.7-stable -- sorry,
I don't have any hosts running -current at present).

That works for me (tested here on two hosts: sparc64 with test load
only; and amd64 with modest production load).

Not sure if that's the best approach or not, but now that we've at
least established root cause, if there's a better way I'm sure someone
else on the list will point it out.

[demime 1.01d removed an attachment of type text/x-patch which had a name of 
httpd_server_accept_tls.patch"; charset="UTF-8]



Re: httpd stops accepting connections after a few hours on current

2015-07-13 Thread Edgar Pettijohn

In my very limited test I could only hang up a server on 443 not 80.



Re: httpd stops accepting connections after a few hours on current

2015-07-13 Thread Alex Greif

Hi,

maybe this problem is related to this one?
http://marc.info/?l=openbsd-misc&m=143091663725238&w=2

thanks,
Alex



Re: httpd stops accepting connections after a few hours on current

2015-07-13 Thread Tor Houghton
On Mon, Jul 13, 2015 at 10:52:46PM +0930, Jack Burton wrote:
> > 
> > I don't pretend to know httpd (at all), but I'm wondering, what should
> > fstat(1) say, over time, for the httpd processes?
> 
> Thanks Tor -- that was exactly the clue I needed to isolate the
> problem.
> 
> [snip]
>
> admin talks to a custom FastCGI daemon, which is most likely the culprit
> -- I'll debug it tomorrow.
> 
> "portal" (the other HTTPS server) also talks to a (different) custom
> FastCGI daemon, but carries orders of magnitude more traffic and didn't
> have any stale sockets -- so clearly our problem is at the other end of
> admin's FastCGI socket (not with httpd itself). Sorry for the noise.
> 
> Ted -- similarly, you may want to look into whatever is at the other end
> of your "server1"'s FastCGI socket. If your issue is the same as ours,
> that's likely where you'll find the cause.
> 

I am not sure you should conclude yet. I don't use FastCGI. ;-}

Now, as I write, I have 218 open fd's, compared to the 206 or whatever I had
in my previous post. I've got a few "dangling" :443 streams (the :80 ones
seem to disappear like they should), and then a bunch of these:

www  httpd  17244  213* internet stream tcp 0x0 *:0

While I have been writing this, the recent (since this morning) fd's have
looked like this ('$' denotes end of list):

www  httpd  17244  206* internet stream tcp 0x0 193.214.208.180:443 <-- 
193.214.208.185:57311
www  httpd  17244  207* internet stream tcp 0x0 *:0
www  httpd  17244  208* internet stream tcp 0x0 *:0
www  httpd  17244  209* internet stream tcp 0x0 *:0
www  httpd  17244  210* internet stream tcp 0x0 *:0
www  httpd  17244  211* internet stream tcp 0x0 *:0
www  httpd  17244  212* internet stream tcp 0x0 *:0
www  httpd  17244  213* internet stream tcp 0x0 *:0
www  httpd  17244  214* internet stream tcp 0x0 193.214.208.180:80 <-- 
66.249.78.231:59307
$

www  httpd  17244  206* internet stream tcp 0x0 193.214.208.180:443 <-- 
193.214.208.185:57311
www  httpd  17244  207* internet stream tcp 0x0 *:0
www  httpd  17244  208* internet stream tcp 0x0 *:0
www  httpd  17244  209* internet stream tcp 0x0 *:0
www  httpd  17244  210* internet stream tcp 0x0 *:0
www  httpd  17244  211* internet stream tcp 0x0 *:0
www  httpd  17244  212* internet stream tcp 0x0 *:0
www  httpd  17244  213* internet stream tcp 0x0 *:0
$

Notice how 214 (:80) closed and went away. 

A few minutes later, I have these:

www  httpd  17244  206* internet stream tcp 0x0 193.214.208.180:443 <-- 
193.214.208.185:57311
www  httpd  17244  207* internet stream tcp 0x0 *:0
www  httpd  17244  208* internet stream tcp 0x0 *:0
www  httpd  17244  209* internet stream tcp 0x0 *:0
www  httpd  17244  210* internet stream tcp 0x0 *:0
www  httpd  17244  211* internet stream tcp 0x0 *:0
www  httpd  17244  212* internet stream tcp 0x0 *:0
www  httpd  17244  213* internet stream tcp 0x0 *:0
www  httpd  17244  214* internet stream tcp 0x0 193.214.208.180:443 <-- 
86.129.139.178:60804
$

www  httpd  17244  206* internet stream tcp 0x0 193.214.208.180:443 <-- 
193.214.208.185:57311
www  httpd  17244  207* internet stream tcp 0x0 *:0
www  httpd  17244  208* internet stream tcp 0x0 *:0
www  httpd  17244  209* internet stream tcp 0x0 *:0
www  httpd  17244  210* internet stream tcp 0x0 *:0
www  httpd  17244  211* internet stream tcp 0x0 *:0
www  httpd  17244  212* internet stream tcp 0x0 *:0
www  httpd  17244  213* internet stream tcp 0x0 *:0
www  httpd  17244  214* internet stream tcp 0x0 *:0
www  httpd  17244  215* internet stream tcp 0x0 *:0
www  httpd  17244  216* internet stream tcp 0x0 193.214.208.180:443 <-- 
86.129.139.178:61345
$

FWIW, the following is a dump from some earlier connections from the same
client (they look too short):

07:23:48.292311 193.214.208.180.443 > 86.129.139.178.51968: S 
4293888040:4293888040(0) ack 147006770 win 16384  (DF)
  : 4500 0034 2113 4000 4006 a4f2 c1d6 d0b4  E..4!.@.@...
  0010: 5681 8bb2 01bb cb00 ffef 8828 08c3 2532  V..(..%2
  0020: 8012 4000 377d  0204 05b4 0101 0402  ..@.7}..
  0030: 0103 0303

07:23:48.345674 86.129.139.178.51968 > 193.214.208.180.443: . ack 1 win 16698 
(DF)
  : 4500 0028 5a8a 4000 7206 3987 5681 8bb2  E..(Z.@.r.9.V...
  0010: c1d6 d0b4 cb00 01bb 08c3 2532 ffef 8829  ..%2...)
  0020: 5010 413a 7711  dd2d     P.A:w-

07:23:48.346721 86.129.139.178.51968 > 193.214.208.180.443: P 1:116(115) ack 1 
win 16698 (DF)
  : 4500 009b 5a8b 4000 7206 3913 5681 8bb2  E...Z.@.r.9.V...
  0010: c1d6 d0b4 cb00 01bb 08c3 2532 ffef 8829  ..%2...)
  0020: 5018 413a 0813  1603 0100 6e01   P.A:..

Re: httpd stops accepting connections after a few hours on current

2015-07-13 Thread Jack Burton
On Mon, 2015-07-13 at 11:02 +0200, Tor Houghton wrote: 
> On Sun, Jul 12, 2015 at 07:56:37PM +0930, Jack Burton wrote:
> > 
> > It is possible I simply failed to provision sufficient capacity --
> > which could easily be fixed by adding a login class for www with a
> > higher limit on open fds -- but I fear that might just be hiding the
> > problem rather than addressing it: exhausting a 512 fd limit with with
> > peak load of only 48 req/sec (and average load of 2 req/sec) just
> > doesn't feel right (especially when that peak load is all 303s
> > generated internally by httpd, which each take only a tiny fraction of
> > a second to process).
> 
> I don't pretend to know httpd (at all), but I'm wondering, what should
> fstat(1) say, over time, for the httpd processes?

Thanks Tor -- that was exactly the clue I needed to isolate the
problem.

Wrote a short script to parse the output of running fstat -p for each
running httpd (we're running with prefork 8, so I didn't fancy doing it
by hand), and report the timestamp of the last request in the relevant
access log of each client IP with an open socket (or 'missing' if no
entry in the current access log).

Ran it roughly 4 hours after the last log rotation and found only 34
matches out of 73 open sockets. We don't run anything here that would
take anywhere near 4 hours to return a response, so the 39 that didn't
match entries in any of the current access logs were clearly where I
needed to look.

All 39 related to "admin" -- the one HTTPS server that I hadn't spent
any time looking into (since it accounts for only 0.02% of httpd's load
here, it didn't occur to me that that tiny little thing could be
bringing httpd to its knees ... famous last words).

admin talks to a custom FastCGI daemon, which is most likely the culprit
-- I'll debug it tomorrow.

"portal" (the other HTTPS server) also talks to a (different) custom
FastCGI daemon, but carries orders of magnitude more traffic and didn't
have any stale sockets -- so clearly our problem is at the other end of
admin's FastCGI socket (not with httpd itself). Sorry for the noise.

Ted -- similarly, you may want to look into whatever is at the other end
of your "server1"'s FastCGI socket. If your issue is the same as ours,
that's likely where you'll find the cause.



Re: httpd stops accepting connections after a few hours on current

2015-07-13 Thread Tor Houghton
On Sun, Jul 12, 2015 at 07:56:37PM +0930, Jack Burton wrote:
> 
> It is possible I simply failed to provision sufficient capacity --
> which could easily be fixed by adding a login class for www with a
> higher limit on open fds -- but I fear that might just be hiding the
> problem rather than addressing it: exhausting a 512 fd limit with with
> peak load of only 48 req/sec (and average load of 2 req/sec) just
> doesn't feel right (especially when that peak load is all 303s
> generated internally by httpd, which each take only a tiny fraction of
> a second to process).
> 

I don't pretend to know httpd (at all), but I'm wondering, what should
fstat(1) say, over time, for the httpd processes?

Of the (2) processes that have streams related to remote hosts, there are
several IP addresses that are never logged to the SSL access log.

For example, this one from umich.edu, which is more than 2 days old*:

$ fstat -p 29431 | grep 141.212.122.50
www  httpd  294315* internet stream tcp 0x0 193.214.208.180:443 <-- 
141.212.122.50:29801
$ fstat -p 17244 |grep 141.212.122.50
$ 

$ grep 141.212.122.50 /local/www/logs/ssl-access.log  
$ 

Is this normal behaviour?

Tor

* This one, from shadowserver.org, which does exist in the ssl-access.log, was
opened on 10 July, same day the server was rebooted:

www  httpd  29431   12* internet stream tcp 0x0 193.214.208.180:443 <-- 
184.105.247.196:35517

www.bogus.net 184.105.247.196 - - [10/Jul/2015:03:41:05 +0200] "GET / HTTP/1.1" 
200 67 "" ""



Re: httpd stops accepting connections after a few hours on current

2015-07-12 Thread Jack Burton
On Sat, 2015-07-11 at 15:38 +0930, Jack Burton wrote: 
> It hasn't happened here in a few days now so I don't have a log extract
> on hand to share (but can post one next time it happens).

Okay, the issue returned this afternoon and the httpd debug output
certainly sheds more light on the problem.

This time we didn't see either the TLS or buffer event errors anywhere
near the time at which httpd stopped responding to requests.

Instead, we're getting "server_accept: deferring connections". According
to the comments in server.c, that means we're running out of file
descriptors.

That struck me as odd, as our traffic generally isn't anywhere near high
enough to expect that, so I checked the traffic at the time and there
was indeed a spike although it didn't seem high enough to cause issues.
Peak load was 48 requests in the one second before httpd stopped
responding to requests.

All 48 of those requests were to the trivial http server, whose config
is just:

  listen on $int_addr port 80
  block return 303 "https://portal.tvir.acscomp.net";

(yes I know that that hostname doesn't resolve publicly -- but it does
when using the resolver assigned by dhcp on the semi-public [but not
Internet-facing] network on which our httpd listens)

As an aside, I didn't see in the debug output any requests during that
final second [although there were two a couple of seconds later] to the
target https server "portal" (which is served by the same instance of
httpd) -- but I guess it's possible that all 48 clients either didn't
act on the 303 or already had its target in their caches (environment
is a residential building for tertiary students, so the user base is
fairly static at this time of year -- so seems well within the realms
of possibility that all 48 had / on portal cached).

Debug output at the time httpd stopped responding reads (after 47 other
requests to the trivial http server all timestamped 16:08:54):

redir 192.168.137.160 - - [12/Jul/2015:16:08:54 +0930] "GET /personal
HTTP/1.1" 303 0
server redir, client 119933 (505 active), 192.168.137.160:40521 ->
192.168.137.1, https://portal.tvir.acscomp.net (303 See Other)
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server redir, client 119935 (505 active), 192.168.137.160:45643 ->
192.168.137.1, done
server redir, client 119934 (504 active), 192.168.137.160:40526 ->
192.168.137.1, done
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server redir, client 119936 (505 active), 192.168.137.160:47925 ->
192.168.137.1, done
server_accept: deferring connections
server_accept: deferring connections
server redir, client 119938 (505 active), 192.168.137.160:40528 ->
192.168.137.1, done
server redir, client 119937 (504 active), 192.168.137.160:40527 ->
192.168.137.1, done
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server redir, client 119940 (505 active), 192.168.137.160:37213 ->
192.168.137.1, done
server_accept: deferring connections
server_accept: deferring connections
portal.tvir.acscomp.net 192.168.137.99 - - [12/Jul/2015:16:08:56 +0930]
"GET / HTTP/1.1" 200 0
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
portal.tvir.acscomp.net 192.168.137.112 - - [12/Jul/2015:16:08:57 +0930]
"GET / HTTP/1.1" 200 0
server_accept: deferring connections

Then nothing but "server_accept: deferring connections" over and over
again.

It is possible I simply failed to provision sufficient capacity --
which could easily be fixed by adding a login class for www with a
higher limit on open fds -- but I fear that might just be hiding the
problem rather than addressing it: exhausting a 512 fd limit with with
peak load of only 48 req/sec (and average load of 2 req/sec) just
doesn't feel right (especially when that peak load is all 303s
generated internally by httpd, which each take only a tiny fraction of
a second to process).

I notice in the source that server_close_http() is responsible for
freeing session-specific fds, and that it's called from server_close(),
which is also responsible for generating the "..., done" debug messages
and decrementing the active client count.

We're only seeing those "..., done" messages in the debug output for a
small proportion of completed HTTP sessions, and the active client count
continues to grow (and only falls occasionally), even when there is much
less HTTP traffic.

Is seems as if some HTTP sessions get their fds freed on completion
while others don't ... but I can't find anything in the source to
support that conjecture.

Could someone who's more familiar with httpd than I am offer a clue
please?



Re: httpd stops accepting connections after a few hours on current

2015-07-10 Thread Jack Burton
On Thu, 2015-07-09 at 11:59 +0200, Tor Houghton wrote: 
> On Wed, Jul 08, 2015 at 10:04:27PM -0500, Theodore Wynnychenko wrote:
> > 
> > [snip]
> >
> > server https://server2.tldn.com, client 2067 (63 active), 10.0.28.254:60330 
> > ->
> > 10.0.28.130:443, buffer event error
> > [..]
> > server https://server2.tldn.com, client 2068 (63 active), 10.0.28.254:52350 
> > ->
> > 10.0.28.130:443, buffer event error
> 
> I'm going to "me too" on this one (have not been until now, as I thought
> perhaps it was due to my setup, and therefore off-topic).

Likewise, seeing the same behaviour here on 5.7-stable -- so the
problem is not confined to -current.

Fairly small & simple httpd setup here, httpd configured with 3 server
stanzas: 2 HTTPS-only (both using FastCGI) plus one trivial HTTP-only
(just a block return 303 pointing to one of the HTTPS servers). Quite a
light load too (averaging 178k requests/day -- about 2/sec).

Frequency of problem varies wildly -- sometimes occurs after only an
hour or two since last httpd restart and at other times httpd will last
for up to 4 days before it stops responding to requests. Variation in
volume of requests appears to have no effect on frequency of recurrence
either.

On every occasion, httpd continues to respond correctly to signals
(httpd restarts are always clean), just not to HTTP[S] requests.

On at least one occasion, the http socket continued to respond correctly
to requests, whilst the two https ones stopped responding. On other
occasions, all 3 stopped responding at around the same time.

When a socket stops responding, it still accepts requests but httpd
neither logs (at least, when not in debug mode) nor responds to them
(i.e. I can successfully open a TCP session to the listening socket and
send it a request, but nothing comes back after the initial ACK).

It hasn't happened here in a few days now so I don't have a log extract
on hand to share (but can post one next time it happens).

>From memory in the past we were seeing TLS accept fail errors in the
logs, as reported by the original poster, but not at the time the
sockets stopped responding (only well beforehand), so I'd also assumed
that those were unrelated. Running tcpdump on both user-facing
interfaces (and on pflog0 just to rule out the possibility of some
error in our pf.conf) whilst httpd was not responding to requests on
previous occasions revealed nothing new.

Have tried watching debug output a couple of times before, but it
rapidly gets quite unwieldy, even with our modest load (especially over
a remote ssh session -- both uplinks at that site are nearing
capacity), given the length of time it can take for the problem to
manifest (on each occasion I gave up after a few hours without the
problem occurring).

Am now running httpd -dvvv with stdout/err redirected to a temporary log
file (probably should have done that in the first place).

We are already seeing (after less than a minute) entries in the debug
logs similar to those reported by Theodore, for example:

* On an HTTPS server (using FastCGI):
server portal, client 305 (14 active), 192.168.137.161:52224 ->
192.168.137.1:443, buffer event error

and

* On the trivial HTTP server (using just a block return 303):
server redir, client 132 (11 active), 192.168.137.100:61081 ->
192.168.137.1, buffer event timeout

However, the original problem (httpd stops responding to requests) is
*not* occurring at present.

Will post debug log extract & httpd.conf next time the problem recurs
(should be within the next few days).



Re: httpd stops accepting connections after a few hours on current

2015-07-10 Thread Theodore Wynnychenko
-Original Message-
From: owner-m...@openbsd.org [mailto:owner-m...@openbsd.org] On Behalf Of Edgar
Pettijohn
Sent: Friday, July 10, 2015 7:32 PM
To: misc@openbsd.org
Subject: Re: httpd stops accepting connections after a few hours on current

On 07/08/15 22:04, Theodore Wynnychenko wrote:
> From: owner-m...@openbsd.org [mailto:owner-m...@openbsd.org] On Behalf Of
Edgar
> Pettijohn
> Sent: Wednesday, July 08, 2015 8:52 PM
> To: misc@openbsd.org
> Subject: Re: httpd stops accepting connections after a few hours on current
>
> On 07/08/15 12:15, Henrik Friedrichsen wrote:
>> I have encountered the same problem as has the guy in [1]. Maybe it is
>> worth a bug report?
>>
>> Could this be related to FastCGI and exhaustion of some ressource?
>>
>> [1]: https://www.facebook.com/groups/2210554563/permalink/10153383131319564/
>>
> Have you run in debug mode?
>
> # httpd -dvvv
>
> --
>
> Hello
> Earlier today, I started it in debug mode.
> What I noticed is that for the first few hours there is nothing much of
> interest.
> A bunch of:
>
> http://server3.tldn.com 10.0.28.254 - - [08/Jul/2015:14:38:31 -0500] "POST
> /rpc/ClientApi HTTP/1.1" 301 0
> server http://server3.tldn.com, client 1552 (36 active), 10.0.28.254:65357 ->
> 10.0.28.131, https://server3.tldn.com/index.html (301 Moved Permanently)
>
> But, then I noticed (and I actually did notice this in the logs the other day,
> but was too stupid to understand that it was - I think now - important):
>
> server https://server2.tldn.com, client 2067 (63 active), 10.0.28.254:60330 ->
> 10.0.28.130:443, buffer event error
> server https://server2.tldn.com, client 1756 (72 active), 10.0.28.254:56005 ->
> 10.0.28.130:443, buffer event error
> server https://server2.tldn.com, client 1521 (58 active), 10.0.28.254:56066 ->
> 10.0.28.130:443, buffer event error
> server https://server2.tldn.com, client 2068 (63 active), 10.0.28.254:52350 ->
> 10.0.28.130:443, buffer event error

The httpd.conf you supplied doesn't show this server serving fastcgi is 
that true?  I'm testing a similiar config to see if I can duplicate the 
error.
>
> It appears that the "buffer event error"s aren't there for a few hours, then
> start, then increase, and eventually there is no more logging, and httpd is
not
> responsive.
>

Correct.  Only "server1" utilizes fastcgi.
"server2" and "server3" only have static content.



Re: httpd stops accepting connections after a few hours on current

2015-07-10 Thread Edgar Pettijohn

On 07/10/15 19:32, Edgar Pettijohn wrote:

On 07/08/15 22:04, Theodore Wynnychenko wrote:
From: owner-m...@openbsd.org [mailto:owner-m...@openbsd.org] On 
Behalf Of Edgar

Pettijohn
Sent: Wednesday, July 08, 2015 8:52 PM
To: misc@openbsd.org
Subject: Re: httpd stops accepting connections after a few hours on 
current


On 07/08/15 12:15, Henrik Friedrichsen wrote:

I have encountered the same problem as has the guy in [1]. Maybe it is
worth a bug report?

Could this be related to FastCGI and exhaustion of some ressource?

[1]: 
https://www.facebook.com/groups/2210554563/permalink/10153383131319564/



Have you run in debug mode?

# httpd -dvvv

--

Hello
Earlier today, I started it in debug mode.
What I noticed is that for the first few hours there is nothing much of
interest.
A bunch of:

http://server3.tldn.com 10.0.28.254 - - [08/Jul/2015:14:38:31 -0500] 
"POST

/rpc/ClientApi HTTP/1.1" 301 0
server http://server3.tldn.com, client 1552 (36 active), 
10.0.28.254:65357 ->

10.0.28.131, https://server3.tldn.com/index.html (301 Moved Permanently)

But, then I noticed (and I actually did notice this in the logs the 
other day,

but was too stupid to understand that it was - I think now - important):

server https://server2.tldn.com, client 2067 (63 active), 
10.0.28.254:60330 ->

10.0.28.130:443, buffer event error
server https://server2.tldn.com, client 1756 (72 active), 
10.0.28.254:56005 ->

10.0.28.130:443, buffer event error
server https://server2.tldn.com, client 1521 (58 active), 
10.0.28.254:56066 ->

10.0.28.130:443, buffer event error
server https://server2.tldn.com, client 2068 (63 active), 
10.0.28.254:52350 ->

10.0.28.130:443, buffer event error


The httpd.conf you supplied doesn't show this server serving fastcgi 
is that true?  I'm testing a similiar config to see if I can duplicate 
the error.


It appears that the "buffer event error"s aren't there for a few 
hours, then
start, then increase, and eventually there is no more logging, and 
httpd is not

responsive.

[demime 1.01d removed an attachment of type 
application/x-pkcs7-signature which had a name of smime.p7s]


This may not have been the most scientific test, but I was able to 
reproduce as follows:


httpd.conf
server "default" {
listen on $ext_addr port 80
directory auto index
}

server "secure" {
listen on 192.168.1.5 tls port 443
tls certificate "/etc/ssl/web.crt"
tls key "/etc/ssl/private/web.key"
directory auto index
}

test.sh
#!/bin/sh

while true
do
wget --no-check-certificate https://192.168.1.5/sasl/index.html 
&& rm index.html

done

had to add --no-check-certificate due to:

ERROR: cannot verify 192.168.1.5's certificate, issued by 
'CN=192.168.1.5,L=Austin,ST=Texas,C=US':

  Self-signed certificate encountered.

I tried a couple of certificates all self signed with the same results.

httpd -dvvv
secure 192.168.1.5 - - [10/Jul/2015:20:52:13 -0500] "GET 
/sasl/index.html HTTP/1.1" 200 5491
server secure, client 3101 (1 active), 192.168.1.5:21322 -> 
192.168.1.5:443, buffer event error
secure 192.168.1.5 - - [10/Jul/2015:20:52:13 -0500] "GET 
/sasl/index.html HTTP/1.1" 200 5491
server secure, client 3002 (3 active), 192.168.1.5:23332 -> 
192.168.1.5:443, buffer event error
secure 192.168.1.5 - - [10/Jul/2015:20:52:13 -0500] "GET 
/sasl/index.html HTTP/1.1" 200 5491
server secure, client 3102 (1 active), 192.168.1.5:23204 -> 
192.168.1.5:443, buffer event error

(after killing test.sh)
server_accept_tls: TLS accept failed - accept failed: Connection reset 
by peer




Re: httpd stops accepting connections after a few hours on current

2015-07-10 Thread Edgar Pettijohn

On 07/08/15 22:04, Theodore Wynnychenko wrote:

From: owner-m...@openbsd.org [mailto:owner-m...@openbsd.org] On Behalf Of Edgar
Pettijohn
Sent: Wednesday, July 08, 2015 8:52 PM
To: misc@openbsd.org
Subject: Re: httpd stops accepting connections after a few hours on current

On 07/08/15 12:15, Henrik Friedrichsen wrote:

I have encountered the same problem as has the guy in [1]. Maybe it is
worth a bug report?

Could this be related to FastCGI and exhaustion of some ressource?

[1]: https://www.facebook.com/groups/2210554563/permalink/10153383131319564/


Have you run in debug mode?

# httpd -dvvv

--

Hello
Earlier today, I started it in debug mode.
What I noticed is that for the first few hours there is nothing much of
interest.
A bunch of:

http://server3.tldn.com 10.0.28.254 - - [08/Jul/2015:14:38:31 -0500] "POST
/rpc/ClientApi HTTP/1.1" 301 0
server http://server3.tldn.com, client 1552 (36 active), 10.0.28.254:65357 ->
10.0.28.131, https://server3.tldn.com/index.html (301 Moved Permanently)

But, then I noticed (and I actually did notice this in the logs the other day,
but was too stupid to understand that it was - I think now - important):

server https://server2.tldn.com, client 2067 (63 active), 10.0.28.254:60330 ->
10.0.28.130:443, buffer event error
server https://server2.tldn.com, client 1756 (72 active), 10.0.28.254:56005 ->
10.0.28.130:443, buffer event error
server https://server2.tldn.com, client 1521 (58 active), 10.0.28.254:56066 ->
10.0.28.130:443, buffer event error
server https://server2.tldn.com, client 2068 (63 active), 10.0.28.254:52350 ->
10.0.28.130:443, buffer event error


The httpd.conf you supplied doesn't show this server serving fastcgi is 
that true?  I'm testing a similiar config to see if I can duplicate the 
error.


It appears that the "buffer event error"s aren't there for a few hours, then
start, then increase, and eventually there is no more logging, and httpd is not
responsive.

[demime 1.01d removed an attachment of type application/x-pkcs7-signature which 
had a name of smime.p7s]




Re: httpd stops accepting connections after a few hours on current

2015-07-09 Thread Matthew Martin
On 7/9/15, Tor Houghton  wrote:
> On Wed, Jul 08, 2015 at 10:04:27PM -0500, Theodore Wynnychenko wrote:
>>
>> [snip]
>>
>> server https://server2.tldn.com, client 2067 (63 active),
>> 10.0.28.254:60330 ->
>> 10.0.28.130:443, buffer event error
>> [..]
>> server https://server2.tldn.com, client 2068 (63 active),
>> 10.0.28.254:52350 ->
>> 10.0.28.130:443, buffer event error
>>
>
> I'm going to "me too" on this one (have not been until now, as I thought
> perhaps it was due to my setup, and therefore off-topic).
>
> I run several OpenBSD guests on VirtualBox, and have been for years without
> any problem (for my use). Since moving to httpd (from apache), I've also
> noticed the buffer event errors. In my case, it's not just httpd that
> hangs,
> but the guest itself, with its corresponding VirtualBox process on the host
> pegging the cpu.
>
> I've also seen
>
> httpd[7197]: server_accept_tls: TLS accept failed - (null)
>
> But this appears to be related to clients not completing / attempting
> proper
> TLS connections (tcpdumps verify this).
>
> I shall endeavour to effort into obtaining som guest debugging information
> if anyone is interested.
>
> Tor
>
>

Possibly related? http://marc.info/?l=openbsd-bugs&m=143370643301193&w=4
After the patch in that thread to fix large file transfers I was
seeing TLS issues pretty quickly. Wound up hopping back to Apache
until I could take another look on current which just hasn't happened
yet. 'buffer event error' and 'TLS accept failed' are definitely
ringing some bells here.



Re: httpd stops accepting connections after a few hours on current

2015-07-09 Thread Tor Houghton
On Wed, Jul 08, 2015 at 10:04:27PM -0500, Theodore Wynnychenko wrote:
> 
> [snip]
>
> server https://server2.tldn.com, client 2067 (63 active), 10.0.28.254:60330 ->
> 10.0.28.130:443, buffer event error
> [..]
> server https://server2.tldn.com, client 2068 (63 active), 10.0.28.254:52350 ->
> 10.0.28.130:443, buffer event error
> 

I'm going to "me too" on this one (have not been until now, as I thought
perhaps it was due to my setup, and therefore off-topic).

I run several OpenBSD guests on VirtualBox, and have been for years without
any problem (for my use). Since moving to httpd (from apache), I've also 
noticed the buffer event errors. In my case, it's not just httpd that hangs,
but the guest itself, with its corresponding VirtualBox process on the host
pegging the cpu.

I've also seen

httpd[7197]: server_accept_tls: TLS accept failed - (null)

But this appears to be related to clients not completing / attempting proper
TLS connections (tcpdumps verify this). 

I shall endeavour to effort into obtaining som guest debugging information
if anyone is interested.

Tor



Re: httpd stops accepting connections after a few hours on current

2015-07-08 Thread Theodore Wynnychenko
From: owner-m...@openbsd.org [mailto:owner-m...@openbsd.org] On Behalf Of Edgar
Pettijohn
Sent: Wednesday, July 08, 2015 8:52 PM
To: misc@openbsd.org
Subject: Re: httpd stops accepting connections after a few hours on current

On 07/08/15 12:15, Henrik Friedrichsen wrote:
> I have encountered the same problem as has the guy in [1]. Maybe it is
> worth a bug report?
>
> Could this be related to FastCGI and exhaustion of some ressource?
>
> [1]: https://www.facebook.com/groups/2210554563/permalink/10153383131319564/
>
Have you run in debug mode?

# httpd -dvvv

--

Hello
Earlier today, I started it in debug mode.
What I noticed is that for the first few hours there is nothing much of
interest.
A bunch of:

http://server3.tldn.com 10.0.28.254 - - [08/Jul/2015:14:38:31 -0500] "POST
/rpc/ClientApi HTTP/1.1" 301 0
server http://server3.tldn.com, client 1552 (36 active), 10.0.28.254:65357 ->
10.0.28.131, https://server3.tldn.com/index.html (301 Moved Permanently)

But, then I noticed (and I actually did notice this in the logs the other day,
but was too stupid to understand that it was - I think now - important):

server https://server2.tldn.com, client 2067 (63 active), 10.0.28.254:60330 ->
10.0.28.130:443, buffer event error
server https://server2.tldn.com, client 1756 (72 active), 10.0.28.254:56005 ->
10.0.28.130:443, buffer event error
server https://server2.tldn.com, client 1521 (58 active), 10.0.28.254:56066 ->
10.0.28.130:443, buffer event error
server https://server2.tldn.com, client 2068 (63 active), 10.0.28.254:52350 ->
10.0.28.130:443, buffer event error

It appears that the "buffer event error"s aren't there for a few hours, then
start, then increase, and eventually there is no more logging, and httpd is not
responsive.

[demime 1.01d removed an attachment of type application/x-pkcs7-signature which 
had a name of smime.p7s]



Re: httpd stops accepting connections after a few hours on current

2015-07-08 Thread Edgar Pettijohn

On 07/08/15 12:15, Henrik Friedrichsen wrote:

I have encountered the same problem as has the guy in [1]. Maybe it is
worth a bug report?

Could this be related to FastCGI and exhaustion of some ressource?

[1]: https://www.facebook.com/groups/2210554563/permalink/10153383131319564/


Have you run in debug mode?

# httpd -dvvv



Re: httpd stops accepting connections after a few hours on current

2015-07-08 Thread Henrik Friedrichsen
I have encountered the same problem as has the guy in [1]. Maybe it is
worth a bug report?

Could this be related to FastCGI and exhaustion of some ressource?

[1]: https://www.facebook.com/groups/2210554563/permalink/10153383131319564/



Re: httpd stops accepting connections after a few hours on current

2015-07-08 Thread Ted Wynnychenko
Hello again:

I was hoping someone might have an idea of how I could proceed with the httpd
failure that I am seeing.

As I said, I see nothing in the logs, and the only indication I have that
something is wrong is that httpd stops responding to requests.

At this point, I have added a cronjob of "rcctl restart httpd" to run every
couple of hours; but, it seems to me that my "solution" is not something that is
consistent with the quality/stability of OpenBSD.

If anyone has any suggestions on what I can do, or what information might be
helpful in identifying the issue, I would appreciate it.

Thanks

-Original Message-

Hello

On current:
OpenBSD 5.8-beta (GENERIC.MP) #1125: Fri Jul  3 20:54:45 MDT 2015
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP

System using 2 aliases on one interface:
ifconfig em0
em0: flags=8843 mtu 1500
lladdr 00:25:90:de:f8:67
priority: 0
groups: egress
media: Ethernet autoselect (1000baseT full-duplex,rxpause,txpause)
status: active
inet 10.0.28.129 netmask 0xff00 broadcast 10.0.28.255
inet 10.0.28.130 netmask 0x
inet 10.0.28.131 netmask 0x


HTTPD listening on all three IP's with different content:
cat /etc/httpd.conf

http_ip="10.0.28.129"

types {
include "/usr/share/misc/mime.types"
}

server "server1.tldn.com" {
listen on $http_ip port 80
connection max requests 100
root "/htdocs"
directory index index.html
block return 301 "https://server1.tldn.com$REQUEST_URI";
}

server "server1.tldn.com" {
listen on $http_ip tls port 443
tls certificate "/etc/ssl/server-129.crt"
tls key "/etc/ssl/private/server-129.key"
connection max requests 100
root "/htdocs"
directory index index.html
location "/cgi-bin/*" {
fastcgi
root "/"
}
}

server "server2.tldn.com" {
listen on 10.0.28.130 port 80
connection max requests 100
root "/htdocs-130"
directory index index.html
block return 301 "https://server2.tldn.com/default.html";
}

server "server2.tldn.com" {
listen on 10.0.28.130 tls port 443
tls certificate "/etc/ssl/server-130.crt"
tls key "/etc/ssl/private/server-130.key"
connection max requests 100
root "/htdocs-130"
directory index index.html
}

server "server3.tldn.com" {
listen on 10.0.28.131 port 80
connection max requests 100
root "/htdocs-131"
directory index index.html
block return 301 "https://server3.tldn.com/index.html";
}

server "server3.tldn.com" {
listen on 10.0.28.131 tls port 443
tls certificate "/etc/ssl/server-131.crt"
tls key "/etc/ssl/private/server-131.key"
connection max requests 100
root "/htdocs-131"
directory index index.html
}


The httpd daemon starts up without an issue and runs without a problem for a few
hours (I have not actually timed it, but it goes for about 6-8 hours), and then
just stops responding.
This is true on all three of the ip's that it is listening on; and whether or
not the original connection is to port 80 or 443.

The access log show all connections until (I presume) the daemon stops answering
connections.

I have tried combining all the ssl certs into one (with a lot of alternative
names), but that did not make a difference.

A simple "rcctl restart httpd" brings the server back up for a few hours.

I would be happy to investigate this more, but I don't know where to go.
Similarly, I don't know what other information I could provide to help diagnose
the problem.

Any help would be great.

Thanks - ted

[demime 1.01d removed an attachment of type application/x-pkcs7-signature which 
had a name of smime.p7s]



httpd stops accepting connections after a few hours on current

2015-07-05 Thread Theodore Wynnychenko
Hello

On current:
OpenBSD 5.8-beta (GENERIC.MP) #1125: Fri Jul  3 20:54:45 MDT 2015
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP

System using 2 aliases on one interface:
ifconfig em0
em0: flags=8843 mtu 1500
lladdr 00:25:90:de:f8:67
priority: 0
groups: egress
media: Ethernet autoselect (1000baseT full-duplex,rxpause,txpause)
status: active
inet 10.0.28.129 netmask 0xff00 broadcast 10.0.28.255
inet 10.0.28.130 netmask 0x
inet 10.0.28.131 netmask 0x


HTTPD listening on all three IP's with different content:
cat /etc/httpd.conf

http_ip="10.0.28.129"

types {
include "/usr/share/misc/mime.types"
}

server "server1.tldn.com" {
listen on $http_ip port 80
connection max requests 100
root "/htdocs"
directory index index.html
block return 301 "https://server1.tldn.com$REQUEST_URI";
}

server "server1.tldn.com" {
listen on $http_ip tls port 443
tls certificate "/etc/ssl/server-129.crt"
tls key "/etc/ssl/private/server-129.key"
connection max requests 100
root "/htdocs"
directory index index.html
location "/cgi-bin/*" {
fastcgi
root "/"
}
}

server "server2.tldn.com" {
listen on 10.0.28.130 port 80
connection max requests 100
root "/htdocs-130"
directory index index.html
block return 301 "https://server2.tldn.com/default.html";
}

server "server2.tldn.com" {
listen on 10.0.28.130 tls port 443
tls certificate "/etc/ssl/server-130.crt"
tls key "/etc/ssl/private/server-130.key"
connection max requests 100
root "/htdocs-130"
directory index index.html
}

server "server3.tldn.com" {
listen on 10.0.28.131 port 80
connection max requests 100
root "/htdocs-131"
directory index index.html
block return 301 "https://server3.tldn.com/index.html";
}

server "server3.tldn.com" {
listen on 10.0.28.131 tls port 443
tls certificate "/etc/ssl/server-131.crt"
tls key "/etc/ssl/private/server-131.key"
connection max requests 100
root "/htdocs-131"
directory index index.html
}


The httpd daemon starts up without an issue and runs without a problem for a few
hours (I have not actually timed it, but it goes for about 6-8 hours), and then
just stops responding.
This is true on all three of the ip's that it is listening on; and whether or
not the original connection is to port 80 or 443.

The access log show all connections until (I presume) the daemon stops answering
connections.

I have tried combining all the ssl certs into one (with a lot of alternative
names), but that did not make a difference.

A simple "rcctl restart httpd" brings the server back up for a few hours.

I would be happy to investigate this more, but I don't know where to go.
Similarly, I don't know what other information I could provide to help diagnose
the problem.

Any help would be great.

Thanks - ted