Re: httpd stops accepting connections after a few hours on current

2015-07-16 Thread Tor Houghton
On Wed, Jul 15, 2015 at 11:40:31PM +0930, Jack Burton wrote:
  
  Post it to tech@ .
 
 Done. See post to tech@ titled httpd: patch to close TLS sockets that
 fail before TLS handshake.
 

Thanks; the fd's on my deployment (5.7-stable) have been stable since
application of this patch 12 hrs ago.

Tor



Re: httpd stops accepting connections after a few hours on current

2015-07-15 Thread Ax0n
Possibly related, I did have an http-only httpd exhibit similar behavior
after running for about 5 days.  httpd processes were running, but port 80
was wholly unresponsive. I attempted to find a way to grab a core from the
processes, but I couldn't figure it out. I had to get the sites back up and
running, so just did a restart on Sunday.

If there's anything I can do in preparation for it to happen again (debug,
etc) I would be happy to help in any way I can. I've been lurking on this
thread since it popped up (which was before my own httpd had problems) but
it doesn't seem like the TLS patch that you are talking about will have any
impact for my setup.

If it helps, included is my (ugly but functioning) httpd.conf:

ext_if=egress
types { include /usr/share/misc/mime.types }
server focushacks.com {
listen on $ext_if port 80
alias www.focushacks.com
directory {index index.html }
directory {index index.php }
log access focushacks.com-access.log
log error focushacks.com-error.log
location /.ht*   { block }
location /.git*  { block }
location /.svn*  { block }
location /*.php* {
root { /docs/www.focushacks.com }
fastcgi socket /run/php-fpm.sock
}
location /list* {
root { /docs/www.focushacks.com }
fastcgi socket /run/php-fpm.sock
}
location /news* {
root { /docs/www.focushacks.com }
fastcgi socket /run/php-fpm.sock
}
location /search* {
root { /docs/www.focushacks.com }
fastcgi socket /run/php-fpm.sock
}
location /mod* {
root { /docs/www.focushacks.com }
fastcgi socket /run/php-fpm.sock
}
location /* {
root { /docs/www.focushacks.com }
}
}
server kcmesh.h-i-r.net {
listen on $ext_if port 80
alias www.km3sh.com
alias km3sh.com
directory {index index.php }
location /*.php* {
root { /docs/kcmesh.h-i-r.net }
fastcgi socket /run/php-fpm.sock
}
location /* {
root { /docs/kcmesh.h-i-r.net }
}
}
server things.h-i-r.net {
listen on $ext_if port 80
directory {index index.php }
location /*.php* {
root { /docs/things.h-i-r.net }
fastcgi socket /run/php-fpm.sock
}
location /* {
root { /docs/things.h-i-r.net }
}
}
server h-i-r.net {
listen on $ext_if port 80
directory {index index.php }
location /*.php* {
root { /docs/h-i-r.net }
fastcgi socket /run/php-fpm.sock
}
location /* {
root { /docs/h-i-r.net }
}
}


On Wed, Jul 15, 2015 at 9:10 AM, Jack Burton j...@saosce.com.au wrote:

 On Wed, 2015-07-15 at 12:56 +, Mike Burns wrote:
  On 2015-07-15 21.49.11 +0930, Jack Burton wrote:
   Sorry, didn't realise I couldn't post a patch to the misc@ (I've never
   needed to before).
  
   Please excuse my ignorance, but what is the accepted way to contribute
 a
   patch?
 
  Post it to tech@ .

 Done. See post to tech@ titled httpd: patch to close TLS sockets that
 fail before TLS handshake.



Re: httpd stops accepting connections after a few hours on current

2015-07-15 Thread Jack Burton
On Wed, 2015-07-15 at 21:41 +0930, Jack Burton wrote: 
 The fix is trivial -- see attached patch (against 5.7-stable -- sorry,
 I don't have any hosts running -current at present).
... 
 [demime 1.01d removed an attachment of type text/x-patch which had a
 name of httpd_server_accept_tls.patch; charset=UTF-8]

Sorry, didn't realise I couldn't post a patch to the misc@ (I've never
needed to before).

Please excuse my ignorance, but what is the accepted way to contribute a
patch?



Re: httpd stops accepting connections after a few hours on current

2015-07-15 Thread Jack Burton
On Mon, 2015-07-13 at 16:19 +0200, Tor Houghton wrote: 
 On Mon, Jul 13, 2015 at 10:52:46PM +0930, Jack Burton wrote:
   
   I don't pretend to know httpd (at all), but I'm wondering, what should
   fstat(1) say, over time, for the httpd processes?
  
  Thanks Tor -- that was exactly the clue I needed to isolate the
  problem.
  
  [snip]
 
  admin talks to a custom FastCGI daemon, which is most likely the culprit
  -- I'll debug it tomorrow.
... 
 
 I am not sure you should conclude yet. I don't use FastCGI. ;-}
 
 Now, as I write, I have 218 open fd's, compared to the 206 or whatever I had
 in my previous post. I've got a few dangling :443 streams (the :80 ones
 seem to disappear like they should), and then a bunch of these:

You're absolutely right -- I spoke too soon.

After double-checking that every possible path a request could take
through the custom FastCGI daemon used by admin ends by sending an
FCGI_END_REQUEST record back to httpd (it does), I turned my attention
back to the httpd logs  debug messages gathered.

This time I had my little script check the remote IP addresses of those
socket against all the httpd access logs (not just the current ones) and
where nothing matched there, finally check the httpd debug output too.

Again, only the admin server (the only one here that's Internet-facing)
had stale sockets (all open sockets for redir  portal matched log
entries) -- out of 26 open sockets, 4 matched log entries for current
HTTPS sessions, 2 matched buffer event error debug messages and the
other 20 didn't match in either the logs or debug messages.

I still don't know what's causing the buffer event error messages, but
as they accounted for only 2 of the 22 stale sockets, I figured it was
more important to focus on the other 20 first.

So, what sort of HTTPS event doesn't make it into the logs and doesn't
cause any debug messages containing the remote IP address to be emitted
either?

The only thing I could think of was a TCP connection to port 443 where
the remote end doesn't initiate a TLS handshake (that's nowhere near as
improbable as it sounds: think a simple port scan, or a network outage
commencing directly after the first ACK).

So, as a test I tried just that: establishing a TCP session from a
remote host then closing it without sending anything at all at layer 5.

Naturally, doing that where httpd expects plain HTTP causes only a
single debug message to be emitted (...done), and the socket gets
closed as expected.

But doing it where httpd expects HTTPS and the local side of the socket
remains open, nothing appears in the regular logs, and nothing
identifiable by remote IP address appears in debug messages either.

Trying to match log/debug entries that aren't identified by the remote
IP address on a host with even a modest amount of traffic struck me as
an exercise in futility, so I tried the same experiment on another host
(also running 5.7-stable) with no other load on httpd at all.

Result was the same: httpd did not close the socket or log anything in
the regular logs. However, one debug message was emitted, our old friend
server_accept_tls: TLS accept failed - (null)...

...which brings us right back to where this thread started.

Looking at the source, server_accept_tls() handles two types of
non-recoverable error condition: timeout after retry and outright
failure. In the first case (EV_TIMEOUT), server_accept_tls() calls
server_close() (which in turn calls server_close_http(), which closes
the socket) before returning; in the second case it does not.

I believe this is the bug we've been looking for.

The fix is trivial -- see attached patch (against 5.7-stable -- sorry,
I don't have any hosts running -current at present).

That works for me (tested here on two hosts: sparc64 with test load
only; and amd64 with modest production load).

Not sure if that's the best approach or not, but now that we've at
least established root cause, if there's a better way I'm sure someone
else on the list will point it out.

[demime 1.01d removed an attachment of type text/x-patch which had a name of 
httpd_server_accept_tls.patch; charset=UTF-8]



Re: httpd stops accepting connections after a few hours on current

2015-07-15 Thread Jack Burton
On Wed, 2015-07-15 at 12:56 +, Mike Burns wrote: 
 On 2015-07-15 21.49.11 +0930, Jack Burton wrote:
  Sorry, didn't realise I couldn't post a patch to the misc@ (I've never
  needed to before).
  
  Please excuse my ignorance, but what is the accepted way to contribute a
  patch?
 
 Post it to tech@ .

Done. See post to tech@ titled httpd: patch to close TLS sockets that
fail before TLS handshake.



Re: httpd stops accepting connections after a few hours on current

2015-07-15 Thread Mike Burns
On 2015-07-15 21.49.11 +0930, Jack Burton wrote:
 Sorry, didn't realise I couldn't post a patch to the misc@ (I've never
 needed to before).
 
 Please excuse my ignorance, but what is the accepted way to contribute a
 patch?

Post it to tech@ .



Re: httpd stops accepting connections after a few hours on current

2015-07-15 Thread ludovic coues
2015-07-15 14:56 GMT+02:00 Mike Burns mike+open...@mike-burns.com:
 On 2015-07-15 21.49.11 +0930, Jack Burton wrote:
 Sorry, didn't realise I couldn't post a patch to the misc@ (I've never
 needed to before).

 Please excuse my ignorance, but what is the accepted way to contribute a
 patch?

 Post it to tech@ .


Also, post patch in unified format and not as attachment but inline.

-- 

Cordialement, Coues Ludovic
+336 148 743 42



Re: httpd stops accepting connections after a few hours on current

2015-07-13 Thread Tor Houghton
On Sun, Jul 12, 2015 at 07:56:37PM +0930, Jack Burton wrote:
 
 It is possible I simply failed to provision sufficient capacity --
 which could easily be fixed by adding a login class for www with a
 higher limit on open fds -- but I fear that might just be hiding the
 problem rather than addressing it: exhausting a 512 fd limit with with
 peak load of only 48 req/sec (and average load of 2 req/sec) just
 doesn't feel right (especially when that peak load is all 303s
 generated internally by httpd, which each take only a tiny fraction of
 a second to process).
 

I don't pretend to know httpd (at all), but I'm wondering, what should
fstat(1) say, over time, for the httpd processes?

Of the (2) processes that have streams related to remote hosts, there are
several IP addresses that are never logged to the SSL access log.

For example, this one from umich.edu, which is more than 2 days old*:

$ fstat -p 29431 | grep 141.212.122.50
www  httpd  294315* internet stream tcp 0x0 193.214.208.180:443 -- 
141.212.122.50:29801
$ fstat -p 17244 |grep 141.212.122.50
$ 

$ grep 141.212.122.50 /local/www/logs/ssl-access.log  
$ 

Is this normal behaviour?

Tor

* This one, from shadowserver.org, which does exist in the ssl-access.log, was
opened on 10 July, same day the server was rebooted:

www  httpd  29431   12* internet stream tcp 0x0 193.214.208.180:443 -- 
184.105.247.196:35517

www.bogus.net 184.105.247.196 - - [10/Jul/2015:03:41:05 +0200] GET / HTTP/1.1 
200 67  



Re: httpd stops accepting connections after a few hours on current

2015-07-13 Thread Tor Houghton
On Mon, Jul 13, 2015 at 10:52:46PM +0930, Jack Burton wrote:
  
  I don't pretend to know httpd (at all), but I'm wondering, what should
  fstat(1) say, over time, for the httpd processes?
 
 Thanks Tor -- that was exactly the clue I needed to isolate the
 problem.
 
 [snip]

 admin talks to a custom FastCGI daemon, which is most likely the culprit
 -- I'll debug it tomorrow.
 
 portal (the other HTTPS server) also talks to a (different) custom
 FastCGI daemon, but carries orders of magnitude more traffic and didn't
 have any stale sockets -- so clearly our problem is at the other end of
 admin's FastCGI socket (not with httpd itself). Sorry for the noise.
 
 Ted -- similarly, you may want to look into whatever is at the other end
 of your server1's FastCGI socket. If your issue is the same as ours,
 that's likely where you'll find the cause.
 

I am not sure you should conclude yet. I don't use FastCGI. ;-}

Now, as I write, I have 218 open fd's, compared to the 206 or whatever I had
in my previous post. I've got a few dangling :443 streams (the :80 ones
seem to disappear like they should), and then a bunch of these:

www  httpd  17244  213* internet stream tcp 0x0 *:0

While I have been writing this, the recent (since this morning) fd's have
looked like this ('$' denotes end of list):

www  httpd  17244  206* internet stream tcp 0x0 193.214.208.180:443 -- 
193.214.208.185:57311
www  httpd  17244  207* internet stream tcp 0x0 *:0
www  httpd  17244  208* internet stream tcp 0x0 *:0
www  httpd  17244  209* internet stream tcp 0x0 *:0
www  httpd  17244  210* internet stream tcp 0x0 *:0
www  httpd  17244  211* internet stream tcp 0x0 *:0
www  httpd  17244  212* internet stream tcp 0x0 *:0
www  httpd  17244  213* internet stream tcp 0x0 *:0
www  httpd  17244  214* internet stream tcp 0x0 193.214.208.180:80 -- 
66.249.78.231:59307
$

www  httpd  17244  206* internet stream tcp 0x0 193.214.208.180:443 -- 
193.214.208.185:57311
www  httpd  17244  207* internet stream tcp 0x0 *:0
www  httpd  17244  208* internet stream tcp 0x0 *:0
www  httpd  17244  209* internet stream tcp 0x0 *:0
www  httpd  17244  210* internet stream tcp 0x0 *:0
www  httpd  17244  211* internet stream tcp 0x0 *:0
www  httpd  17244  212* internet stream tcp 0x0 *:0
www  httpd  17244  213* internet stream tcp 0x0 *:0
$

Notice how 214 (:80) closed and went away. 

A few minutes later, I have these:

www  httpd  17244  206* internet stream tcp 0x0 193.214.208.180:443 -- 
193.214.208.185:57311
www  httpd  17244  207* internet stream tcp 0x0 *:0
www  httpd  17244  208* internet stream tcp 0x0 *:0
www  httpd  17244  209* internet stream tcp 0x0 *:0
www  httpd  17244  210* internet stream tcp 0x0 *:0
www  httpd  17244  211* internet stream tcp 0x0 *:0
www  httpd  17244  212* internet stream tcp 0x0 *:0
www  httpd  17244  213* internet stream tcp 0x0 *:0
www  httpd  17244  214* internet stream tcp 0x0 193.214.208.180:443 -- 
86.129.139.178:60804
$

www  httpd  17244  206* internet stream tcp 0x0 193.214.208.180:443 -- 
193.214.208.185:57311
www  httpd  17244  207* internet stream tcp 0x0 *:0
www  httpd  17244  208* internet stream tcp 0x0 *:0
www  httpd  17244  209* internet stream tcp 0x0 *:0
www  httpd  17244  210* internet stream tcp 0x0 *:0
www  httpd  17244  211* internet stream tcp 0x0 *:0
www  httpd  17244  212* internet stream tcp 0x0 *:0
www  httpd  17244  213* internet stream tcp 0x0 *:0
www  httpd  17244  214* internet stream tcp 0x0 *:0
www  httpd  17244  215* internet stream tcp 0x0 *:0
www  httpd  17244  216* internet stream tcp 0x0 193.214.208.180:443 -- 
86.129.139.178:61345
$

FWIW, the following is a dump from some earlier connections from the same
client (they look too short):

07:23:48.292311 193.214.208.180.443  86.129.139.178.51968: S 
4293888040:4293888040(0) ack 147006770 win 16384 mss 
1460,nop,nop,sackOK,nop,wscale 3 (DF)
  : 4500 0034 2113 4000 4006 a4f2 c1d6 d0b4  E..4!.@.@...
  0010: 5681 8bb2 01bb cb00 ffef 8828 08c3 2532  V..(..%2
  0020: 8012 4000 377d  0204 05b4 0101 0402  ..@.7}..
  0030: 0103 0303

07:23:48.345674 86.129.139.178.51968  193.214.208.180.443: . ack 1 win 16698 
(DF)
  : 4500 0028 5a8a 4000 7206 3987 5681 8bb2  E..(Z.@.r.9.V...
  0010: c1d6 d0b4 cb00 01bb 08c3 2532 ffef 8829  ..%2...)
  0020: 5010 413a 7711  dd2d     P.A:w-

07:23:48.346721 86.129.139.178.51968  193.214.208.180.443: P 1:116(115) ack 1 
win 16698 (DF)
  : 4500 009b 5a8b 4000 7206 3913 5681 8bb2  E...Z.@.r.9.V...
  0010: c1d6 d0b4 cb00 01bb 08c3 2532 ffef 8829  ..%2...)
  0020: 5018 413a 0813  1603 0100 6e01   

Re: httpd stops accepting connections after a few hours on current

2015-07-13 Thread Alex Greif

Hi,

maybe this problem is related to this one?
http://marc.info/?l=openbsd-miscm=143091663725238w=2

thanks,
Alex



Re: httpd stops accepting connections after a few hours on current

2015-07-13 Thread Jack Burton
On Mon, 2015-07-13 at 11:02 +0200, Tor Houghton wrote: 
 On Sun, Jul 12, 2015 at 07:56:37PM +0930, Jack Burton wrote:
  
  It is possible I simply failed to provision sufficient capacity --
  which could easily be fixed by adding a login class for www with a
  higher limit on open fds -- but I fear that might just be hiding the
  problem rather than addressing it: exhausting a 512 fd limit with with
  peak load of only 48 req/sec (and average load of 2 req/sec) just
  doesn't feel right (especially when that peak load is all 303s
  generated internally by httpd, which each take only a tiny fraction of
  a second to process).
 
 I don't pretend to know httpd (at all), but I'm wondering, what should
 fstat(1) say, over time, for the httpd processes?

Thanks Tor -- that was exactly the clue I needed to isolate the
problem.

Wrote a short script to parse the output of running fstat -p for each
running httpd (we're running with prefork 8, so I didn't fancy doing it
by hand), and report the timestamp of the last request in the relevant
access log of each client IP with an open socket (or 'missing' if no
entry in the current access log).

Ran it roughly 4 hours after the last log rotation and found only 34
matches out of 73 open sockets. We don't run anything here that would
take anywhere near 4 hours to return a response, so the 39 that didn't
match entries in any of the current access logs were clearly where I
needed to look.

All 39 related to admin -- the one HTTPS server that I hadn't spent
any time looking into (since it accounts for only 0.02% of httpd's load
here, it didn't occur to me that that tiny little thing could be
bringing httpd to its knees ... famous last words).

admin talks to a custom FastCGI daemon, which is most likely the culprit
-- I'll debug it tomorrow.

portal (the other HTTPS server) also talks to a (different) custom
FastCGI daemon, but carries orders of magnitude more traffic and didn't
have any stale sockets -- so clearly our problem is at the other end of
admin's FastCGI socket (not with httpd itself). Sorry for the noise.

Ted -- similarly, you may want to look into whatever is at the other end
of your server1's FastCGI socket. If your issue is the same as ours,
that's likely where you'll find the cause.



Re: httpd stops accepting connections after a few hours on current

2015-07-13 Thread Edgar Pettijohn

In my very limited test I could only hang up a server on 443 not 80.



Re: httpd stops accepting connections after a few hours on current

2015-07-12 Thread Jack Burton
On Sat, 2015-07-11 at 15:38 +0930, Jack Burton wrote: 
 It hasn't happened here in a few days now so I don't have a log extract
 on hand to share (but can post one next time it happens).

Okay, the issue returned this afternoon and the httpd debug output
certainly sheds more light on the problem.

This time we didn't see either the TLS or buffer event errors anywhere
near the time at which httpd stopped responding to requests.

Instead, we're getting server_accept: deferring connections. According
to the comments in server.c, that means we're running out of file
descriptors.

That struck me as odd, as our traffic generally isn't anywhere near high
enough to expect that, so I checked the traffic at the time and there
was indeed a spike although it didn't seem high enough to cause issues.
Peak load was 48 requests in the one second before httpd stopped
responding to requests.

All 48 of those requests were to the trivial http server, whose config
is just:

  listen on $int_addr port 80
  block return 303 https://portal.tvir.acscomp.net;

(yes I know that that hostname doesn't resolve publicly -- but it does
when using the resolver assigned by dhcp on the semi-public [but not
Internet-facing] network on which our httpd listens)

As an aside, I didn't see in the debug output any requests during that
final second [although there were two a couple of seconds later] to the
target https server portal (which is served by the same instance of
httpd) -- but I guess it's possible that all 48 clients either didn't
act on the 303 or already had its target in their caches (environment
is a residential building for tertiary students, so the user base is
fairly static at this time of year -- so seems well within the realms
of possibility that all 48 had / on portal cached).

Debug output at the time httpd stopped responding reads (after 47 other
requests to the trivial http server all timestamped 16:08:54):

redir 192.168.137.160 - - [12/Jul/2015:16:08:54 +0930] GET /personal
HTTP/1.1 303 0
server redir, client 119933 (505 active), 192.168.137.160:40521 -
192.168.137.1, https://portal.tvir.acscomp.net (303 See Other)
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server redir, client 119935 (505 active), 192.168.137.160:45643 -
192.168.137.1, done
server redir, client 119934 (504 active), 192.168.137.160:40526 -
192.168.137.1, done
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server redir, client 119936 (505 active), 192.168.137.160:47925 -
192.168.137.1, done
server_accept: deferring connections
server_accept: deferring connections
server redir, client 119938 (505 active), 192.168.137.160:40528 -
192.168.137.1, done
server redir, client 119937 (504 active), 192.168.137.160:40527 -
192.168.137.1, done
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server redir, client 119940 (505 active), 192.168.137.160:37213 -
192.168.137.1, done
server_accept: deferring connections
server_accept: deferring connections
portal.tvir.acscomp.net 192.168.137.99 - - [12/Jul/2015:16:08:56 +0930]
GET / HTTP/1.1 200 0
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
portal.tvir.acscomp.net 192.168.137.112 - - [12/Jul/2015:16:08:57 +0930]
GET / HTTP/1.1 200 0
server_accept: deferring connections

Then nothing but server_accept: deferring connections over and over
again.

It is possible I simply failed to provision sufficient capacity --
which could easily be fixed by adding a login class for www with a
higher limit on open fds -- but I fear that might just be hiding the
problem rather than addressing it: exhausting a 512 fd limit with with
peak load of only 48 req/sec (and average load of 2 req/sec) just
doesn't feel right (especially when that peak load is all 303s
generated internally by httpd, which each take only a tiny fraction of
a second to process).

I notice in the source that server_close_http() is responsible for
freeing session-specific fds, and that it's called from server_close(),
which is also responsible for generating the ..., done debug messages
and decrementing the active client count.

We're only seeing those ..., done messages in the debug output for a
small proportion of completed HTTP sessions, and the active client count
continues to grow (and only falls occasionally), even when there is much
less HTTP traffic.

Is seems as if some HTTP sessions get their fds freed on completion
while others don't ... but I can't find anything in the source to
support that conjecture.

Could someone who's more familiar with httpd than I am offer a clue
please?



Re: httpd stops accepting connections after a few hours on current

2015-07-11 Thread Jack Burton
On Thu, 2015-07-09 at 11:59 +0200, Tor Houghton wrote: 
 On Wed, Jul 08, 2015 at 10:04:27PM -0500, Theodore Wynnychenko wrote:
  
  [snip]
 
  server https://server2.tldn.com, client 2067 (63 active), 10.0.28.254:60330 
  -
  10.0.28.130:443, buffer event error
  [..]
  server https://server2.tldn.com, client 2068 (63 active), 10.0.28.254:52350 
  -
  10.0.28.130:443, buffer event error
 
 I'm going to me too on this one (have not been until now, as I thought
 perhaps it was due to my setup, and therefore off-topic).

Likewise, seeing the same behaviour here on 5.7-stable -- so the
problem is not confined to -current.

Fairly small  simple httpd setup here, httpd configured with 3 server
stanzas: 2 HTTPS-only (both using FastCGI) plus one trivial HTTP-only
(just a block return 303 pointing to one of the HTTPS servers). Quite a
light load too (averaging 178k requests/day -- about 2/sec).

Frequency of problem varies wildly -- sometimes occurs after only an
hour or two since last httpd restart and at other times httpd will last
for up to 4 days before it stops responding to requests. Variation in
volume of requests appears to have no effect on frequency of recurrence
either.

On every occasion, httpd continues to respond correctly to signals
(httpd restarts are always clean), just not to HTTP[S] requests.

On at least one occasion, the http socket continued to respond correctly
to requests, whilst the two https ones stopped responding. On other
occasions, all 3 stopped responding at around the same time.

When a socket stops responding, it still accepts requests but httpd
neither logs (at least, when not in debug mode) nor responds to them
(i.e. I can successfully open a TCP session to the listening socket and
send it a request, but nothing comes back after the initial ACK).

It hasn't happened here in a few days now so I don't have a log extract
on hand to share (but can post one next time it happens).

From memory in the past we were seeing TLS accept fail errors in the
logs, as reported by the original poster, but not at the time the
sockets stopped responding (only well beforehand), so I'd also assumed
that those were unrelated. Running tcpdump on both user-facing
interfaces (and on pflog0 just to rule out the possibility of some
error in our pf.conf) whilst httpd was not responding to requests on
previous occasions revealed nothing new.

Have tried watching debug output a couple of times before, but it
rapidly gets quite unwieldy, even with our modest load (especially over
a remote ssh session -- both uplinks at that site are nearing
capacity), given the length of time it can take for the problem to
manifest (on each occasion I gave up after a few hours without the
problem occurring).

Am now running httpd -dvvv with stdout/err redirected to a temporary log
file (probably should have done that in the first place).

We are already seeing (after less than a minute) entries in the debug
logs similar to those reported by Theodore, for example:

* On an HTTPS server (using FastCGI):
server portal, client 305 (14 active), 192.168.137.161:52224 -
192.168.137.1:443, buffer event error

and

* On the trivial HTTP server (using just a block return 303):
server redir, client 132 (11 active), 192.168.137.100:61081 -
192.168.137.1, buffer event timeout

However, the original problem (httpd stops responding to requests) is
*not* occurring at present.

Will post debug log extract  httpd.conf next time the problem recurs
(should be within the next few days).



Re: httpd stops accepting connections after a few hours on current

2015-07-10 Thread Theodore Wynnychenko
-Original Message-
From: owner-m...@openbsd.org [mailto:owner-m...@openbsd.org] On Behalf Of Edgar
Pettijohn
Sent: Friday, July 10, 2015 7:32 PM
To: misc@openbsd.org
Subject: Re: httpd stops accepting connections after a few hours on current

On 07/08/15 22:04, Theodore Wynnychenko wrote:
 From: owner-m...@openbsd.org [mailto:owner-m...@openbsd.org] On Behalf Of
Edgar
 Pettijohn
 Sent: Wednesday, July 08, 2015 8:52 PM
 To: misc@openbsd.org
 Subject: Re: httpd stops accepting connections after a few hours on current

 On 07/08/15 12:15, Henrik Friedrichsen wrote:
 I have encountered the same problem as has the guy in [1]. Maybe it is
 worth a bug report?

 Could this be related to FastCGI and exhaustion of some ressource?

 [1]: https://www.facebook.com/groups/2210554563/permalink/10153383131319564/

 Have you run in debug mode?

 # httpd -dvvv

 --

 Hello
 Earlier today, I started it in debug mode.
 What I noticed is that for the first few hours there is nothing much of
 interest.
 A bunch of:

 http://server3.tldn.com 10.0.28.254 - - [08/Jul/2015:14:38:31 -0500] POST
 /rpc/ClientApi HTTP/1.1 301 0
 server http://server3.tldn.com, client 1552 (36 active), 10.0.28.254:65357 -
 10.0.28.131, https://server3.tldn.com/index.html (301 Moved Permanently)

 But, then I noticed (and I actually did notice this in the logs the other day,
 but was too stupid to understand that it was - I think now - important):

 server https://server2.tldn.com, client 2067 (63 active), 10.0.28.254:60330 -
 10.0.28.130:443, buffer event error
 server https://server2.tldn.com, client 1756 (72 active), 10.0.28.254:56005 -
 10.0.28.130:443, buffer event error
 server https://server2.tldn.com, client 1521 (58 active), 10.0.28.254:56066 -
 10.0.28.130:443, buffer event error
 server https://server2.tldn.com, client 2068 (63 active), 10.0.28.254:52350 -
 10.0.28.130:443, buffer event error

The httpd.conf you supplied doesn't show this server serving fastcgi is 
that true?  I'm testing a similiar config to see if I can duplicate the 
error.

 It appears that the buffer event errors aren't there for a few hours, then
 start, then increase, and eventually there is no more logging, and httpd is
not
 responsive.


Correct.  Only server1 utilizes fastcgi.
server2 and server3 only have static content.



Re: httpd stops accepting connections after a few hours on current

2015-07-10 Thread Edgar Pettijohn

On 07/08/15 22:04, Theodore Wynnychenko wrote:

From: owner-m...@openbsd.org [mailto:owner-m...@openbsd.org] On Behalf Of Edgar
Pettijohn
Sent: Wednesday, July 08, 2015 8:52 PM
To: misc@openbsd.org
Subject: Re: httpd stops accepting connections after a few hours on current

On 07/08/15 12:15, Henrik Friedrichsen wrote:

I have encountered the same problem as has the guy in [1]. Maybe it is
worth a bug report?

Could this be related to FastCGI and exhaustion of some ressource?

[1]: https://www.facebook.com/groups/2210554563/permalink/10153383131319564/


Have you run in debug mode?

# httpd -dvvv

--

Hello
Earlier today, I started it in debug mode.
What I noticed is that for the first few hours there is nothing much of
interest.
A bunch of:

http://server3.tldn.com 10.0.28.254 - - [08/Jul/2015:14:38:31 -0500] POST
/rpc/ClientApi HTTP/1.1 301 0
server http://server3.tldn.com, client 1552 (36 active), 10.0.28.254:65357 -
10.0.28.131, https://server3.tldn.com/index.html (301 Moved Permanently)

But, then I noticed (and I actually did notice this in the logs the other day,
but was too stupid to understand that it was - I think now - important):

server https://server2.tldn.com, client 2067 (63 active), 10.0.28.254:60330 -
10.0.28.130:443, buffer event error
server https://server2.tldn.com, client 1756 (72 active), 10.0.28.254:56005 -
10.0.28.130:443, buffer event error
server https://server2.tldn.com, client 1521 (58 active), 10.0.28.254:56066 -
10.0.28.130:443, buffer event error
server https://server2.tldn.com, client 2068 (63 active), 10.0.28.254:52350 -
10.0.28.130:443, buffer event error


The httpd.conf you supplied doesn't show this server serving fastcgi is 
that true?  I'm testing a similiar config to see if I can duplicate the 
error.


It appears that the buffer event errors aren't there for a few hours, then
start, then increase, and eventually there is no more logging, and httpd is not
responsive.

[demime 1.01d removed an attachment of type application/x-pkcs7-signature which 
had a name of smime.p7s]




Re: httpd stops accepting connections after a few hours on current

2015-07-10 Thread Edgar Pettijohn

On 07/10/15 19:32, Edgar Pettijohn wrote:

On 07/08/15 22:04, Theodore Wynnychenko wrote:
From: owner-m...@openbsd.org [mailto:owner-m...@openbsd.org] On 
Behalf Of Edgar

Pettijohn
Sent: Wednesday, July 08, 2015 8:52 PM
To: misc@openbsd.org
Subject: Re: httpd stops accepting connections after a few hours on 
current


On 07/08/15 12:15, Henrik Friedrichsen wrote:

I have encountered the same problem as has the guy in [1]. Maybe it is
worth a bug report?

Could this be related to FastCGI and exhaustion of some ressource?

[1]: 
https://www.facebook.com/groups/2210554563/permalink/10153383131319564/



Have you run in debug mode?

# httpd -dvvv

--

Hello
Earlier today, I started it in debug mode.
What I noticed is that for the first few hours there is nothing much of
interest.
A bunch of:

http://server3.tldn.com 10.0.28.254 - - [08/Jul/2015:14:38:31 -0500] 
POST

/rpc/ClientApi HTTP/1.1 301 0
server http://server3.tldn.com, client 1552 (36 active), 
10.0.28.254:65357 -

10.0.28.131, https://server3.tldn.com/index.html (301 Moved Permanently)

But, then I noticed (and I actually did notice this in the logs the 
other day,

but was too stupid to understand that it was - I think now - important):

server https://server2.tldn.com, client 2067 (63 active), 
10.0.28.254:60330 -

10.0.28.130:443, buffer event error
server https://server2.tldn.com, client 1756 (72 active), 
10.0.28.254:56005 -

10.0.28.130:443, buffer event error
server https://server2.tldn.com, client 1521 (58 active), 
10.0.28.254:56066 -

10.0.28.130:443, buffer event error
server https://server2.tldn.com, client 2068 (63 active), 
10.0.28.254:52350 -

10.0.28.130:443, buffer event error


The httpd.conf you supplied doesn't show this server serving fastcgi 
is that true?  I'm testing a similiar config to see if I can duplicate 
the error.


It appears that the buffer event errors aren't there for a few 
hours, then
start, then increase, and eventually there is no more logging, and 
httpd is not

responsive.

[demime 1.01d removed an attachment of type 
application/x-pkcs7-signature which had a name of smime.p7s]


This may not have been the most scientific test, but I was able to 
reproduce as follows:


httpd.conf
server default {
listen on $ext_addr port 80
directory auto index
}

server secure {
listen on 192.168.1.5 tls port 443
tls certificate /etc/ssl/web.crt
tls key /etc/ssl/private/web.key
directory auto index
}

test.sh
#!/bin/sh

while true
do
wget --no-check-certificate https://192.168.1.5/sasl/index.html 
 rm index.html

done

had to add --no-check-certificate due to:

ERROR: cannot verify 192.168.1.5's certificate, issued by 
'CN=192.168.1.5,L=Austin,ST=Texas,C=US':

  Self-signed certificate encountered.

I tried a couple of certificates all self signed with the same results.

httpd -dvvv
secure 192.168.1.5 - - [10/Jul/2015:20:52:13 -0500] GET 
/sasl/index.html HTTP/1.1 200 5491
server secure, client 3101 (1 active), 192.168.1.5:21322 - 
192.168.1.5:443, buffer event error
secure 192.168.1.5 - - [10/Jul/2015:20:52:13 -0500] GET 
/sasl/index.html HTTP/1.1 200 5491
server secure, client 3002 (3 active), 192.168.1.5:23332 - 
192.168.1.5:443, buffer event error
secure 192.168.1.5 - - [10/Jul/2015:20:52:13 -0500] GET 
/sasl/index.html HTTP/1.1 200 5491
server secure, client 3102 (1 active), 192.168.1.5:23204 - 
192.168.1.5:443, buffer event error

(after killing test.sh)
server_accept_tls: TLS accept failed - accept failed: Connection reset 
by peer




Re: httpd stops accepting connections after a few hours on current

2015-07-09 Thread Tor Houghton
On Wed, Jul 08, 2015 at 10:04:27PM -0500, Theodore Wynnychenko wrote:
 
 [snip]

 server https://server2.tldn.com, client 2067 (63 active), 10.0.28.254:60330 -
 10.0.28.130:443, buffer event error
 [..]
 server https://server2.tldn.com, client 2068 (63 active), 10.0.28.254:52350 -
 10.0.28.130:443, buffer event error
 

I'm going to me too on this one (have not been until now, as I thought
perhaps it was due to my setup, and therefore off-topic).

I run several OpenBSD guests on VirtualBox, and have been for years without
any problem (for my use). Since moving to httpd (from apache), I've also 
noticed the buffer event errors. In my case, it's not just httpd that hangs,
but the guest itself, with its corresponding VirtualBox process on the host
pegging the cpu.

I've also seen

httpd[7197]: server_accept_tls: TLS accept failed - (null)

But this appears to be related to clients not completing / attempting proper
TLS connections (tcpdumps verify this). 

I shall endeavour to effort into obtaining som guest debugging information
if anyone is interested.

Tor



Re: httpd stops accepting connections after a few hours on current

2015-07-09 Thread Matthew Martin
On 7/9/15, Tor Houghton t...@bogus.net wrote:
 On Wed, Jul 08, 2015 at 10:04:27PM -0500, Theodore Wynnychenko wrote:

 [snip]

 server https://server2.tldn.com, client 2067 (63 active),
 10.0.28.254:60330 -
 10.0.28.130:443, buffer event error
 [..]
 server https://server2.tldn.com, client 2068 (63 active),
 10.0.28.254:52350 -
 10.0.28.130:443, buffer event error


 I'm going to me too on this one (have not been until now, as I thought
 perhaps it was due to my setup, and therefore off-topic).

 I run several OpenBSD guests on VirtualBox, and have been for years without
 any problem (for my use). Since moving to httpd (from apache), I've also
 noticed the buffer event errors. In my case, it's not just httpd that
 hangs,
 but the guest itself, with its corresponding VirtualBox process on the host
 pegging the cpu.

 I've also seen

 httpd[7197]: server_accept_tls: TLS accept failed - (null)

 But this appears to be related to clients not completing / attempting
 proper
 TLS connections (tcpdumps verify this).

 I shall endeavour to effort into obtaining som guest debugging information
 if anyone is interested.

 Tor



Possibly related? http://marc.info/?l=openbsd-bugsm=143370643301193w=4
After the patch in that thread to fix large file transfers I was
seeing TLS issues pretty quickly. Wound up hopping back to Apache
until I could take another look on current which just hasn't happened
yet. 'buffer event error' and 'TLS accept failed' are definitely
ringing some bells here.



Re: httpd stops accepting connections after a few hours on current

2015-07-08 Thread Ted Wynnychenko
Hello again:

I was hoping someone might have an idea of how I could proceed with the httpd
failure that I am seeing.

As I said, I see nothing in the logs, and the only indication I have that
something is wrong is that httpd stops responding to requests.

At this point, I have added a cronjob of rcctl restart httpd to run every
couple of hours; but, it seems to me that my solution is not something that is
consistent with the quality/stability of OpenBSD.

If anyone has any suggestions on what I can do, or what information might be
helpful in identifying the issue, I would appreciate it.

Thanks

-Original Message-

Hello

On current:
OpenBSD 5.8-beta (GENERIC.MP) #1125: Fri Jul  3 20:54:45 MDT 2015
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP

System using 2 aliases on one interface:
ifconfig em0
em0: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST mtu 1500
lladdr 00:25:90:de:f8:67
priority: 0
groups: egress
media: Ethernet autoselect (1000baseT full-duplex,rxpause,txpause)
status: active
inet 10.0.28.129 netmask 0xff00 broadcast 10.0.28.255
inet 10.0.28.130 netmask 0x
inet 10.0.28.131 netmask 0x


HTTPD listening on all three IP's with different content:
cat /etc/httpd.conf

http_ip=10.0.28.129

types {
include /usr/share/misc/mime.types
}

server server1.tldn.com {
listen on $http_ip port 80
connection max requests 100
root /htdocs
directory index index.html
block return 301 https://server1.tldn.com$REQUEST_URI;
}

server server1.tldn.com {
listen on $http_ip tls port 443
tls certificate /etc/ssl/server-129.crt
tls key /etc/ssl/private/server-129.key
connection max requests 100
root /htdocs
directory index index.html
location /cgi-bin/* {
fastcgi
root /
}
}

server server2.tldn.com {
listen on 10.0.28.130 port 80
connection max requests 100
root /htdocs-130
directory index index.html
block return 301 https://server2.tldn.com/default.html;
}

server server2.tldn.com {
listen on 10.0.28.130 tls port 443
tls certificate /etc/ssl/server-130.crt
tls key /etc/ssl/private/server-130.key
connection max requests 100
root /htdocs-130
directory index index.html
}

server server3.tldn.com {
listen on 10.0.28.131 port 80
connection max requests 100
root /htdocs-131
directory index index.html
block return 301 https://server3.tldn.com/index.html;
}

server server3.tldn.com {
listen on 10.0.28.131 tls port 443
tls certificate /etc/ssl/server-131.crt
tls key /etc/ssl/private/server-131.key
connection max requests 100
root /htdocs-131
directory index index.html
}


The httpd daemon starts up without an issue and runs without a problem for a few
hours (I have not actually timed it, but it goes for about 6-8 hours), and then
just stops responding.
This is true on all three of the ip's that it is listening on; and whether or
not the original connection is to port 80 or 443.

The access log show all connections until (I presume) the daemon stops answering
connections.

I have tried combining all the ssl certs into one (with a lot of alternative
names), but that did not make a difference.

A simple rcctl restart httpd brings the server back up for a few hours.

I would be happy to investigate this more, but I don't know where to go.
Similarly, I don't know what other information I could provide to help diagnose
the problem.

Any help would be great.

Thanks - ted

[demime 1.01d removed an attachment of type application/x-pkcs7-signature which 
had a name of smime.p7s]



Re: httpd stops accepting connections after a few hours on current

2015-07-08 Thread Henrik Friedrichsen
I have encountered the same problem as has the guy in [1]. Maybe it is
worth a bug report?

Could this be related to FastCGI and exhaustion of some ressource?

[1]: https://www.facebook.com/groups/2210554563/permalink/10153383131319564/



Re: httpd stops accepting connections after a few hours on current

2015-07-08 Thread Theodore Wynnychenko
From: owner-m...@openbsd.org [mailto:owner-m...@openbsd.org] On Behalf Of Edgar
Pettijohn
Sent: Wednesday, July 08, 2015 8:52 PM
To: misc@openbsd.org
Subject: Re: httpd stops accepting connections after a few hours on current

On 07/08/15 12:15, Henrik Friedrichsen wrote:
 I have encountered the same problem as has the guy in [1]. Maybe it is
 worth a bug report?

 Could this be related to FastCGI and exhaustion of some ressource?

 [1]: https://www.facebook.com/groups/2210554563/permalink/10153383131319564/

Have you run in debug mode?

# httpd -dvvv

--

Hello
Earlier today, I started it in debug mode.
What I noticed is that for the first few hours there is nothing much of
interest.
A bunch of:

http://server3.tldn.com 10.0.28.254 - - [08/Jul/2015:14:38:31 -0500] POST
/rpc/ClientApi HTTP/1.1 301 0
server http://server3.tldn.com, client 1552 (36 active), 10.0.28.254:65357 -
10.0.28.131, https://server3.tldn.com/index.html (301 Moved Permanently)

But, then I noticed (and I actually did notice this in the logs the other day,
but was too stupid to understand that it was - I think now - important):

server https://server2.tldn.com, client 2067 (63 active), 10.0.28.254:60330 -
10.0.28.130:443, buffer event error
server https://server2.tldn.com, client 1756 (72 active), 10.0.28.254:56005 -
10.0.28.130:443, buffer event error
server https://server2.tldn.com, client 1521 (58 active), 10.0.28.254:56066 -
10.0.28.130:443, buffer event error
server https://server2.tldn.com, client 2068 (63 active), 10.0.28.254:52350 -
10.0.28.130:443, buffer event error

It appears that the buffer event errors aren't there for a few hours, then
start, then increase, and eventually there is no more logging, and httpd is not
responsive.

[demime 1.01d removed an attachment of type application/x-pkcs7-signature which 
had a name of smime.p7s]



Re: httpd stops accepting connections after a few hours on current

2015-07-08 Thread Edgar Pettijohn

On 07/08/15 12:15, Henrik Friedrichsen wrote:

I have encountered the same problem as has the guy in [1]. Maybe it is
worth a bug report?

Could this be related to FastCGI and exhaustion of some ressource?

[1]: https://www.facebook.com/groups/2210554563/permalink/10153383131319564/


Have you run in debug mode?

# httpd -dvvv



httpd stops accepting connections after a few hours on current

2015-07-05 Thread Theodore Wynnychenko
Hello

On current:
OpenBSD 5.8-beta (GENERIC.MP) #1125: Fri Jul  3 20:54:45 MDT 2015
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP

System using 2 aliases on one interface:
ifconfig em0
em0: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST mtu 1500
lladdr 00:25:90:de:f8:67
priority: 0
groups: egress
media: Ethernet autoselect (1000baseT full-duplex,rxpause,txpause)
status: active
inet 10.0.28.129 netmask 0xff00 broadcast 10.0.28.255
inet 10.0.28.130 netmask 0x
inet 10.0.28.131 netmask 0x


HTTPD listening on all three IP's with different content:
cat /etc/httpd.conf

http_ip=10.0.28.129

types {
include /usr/share/misc/mime.types
}

server server1.tldn.com {
listen on $http_ip port 80
connection max requests 100
root /htdocs
directory index index.html
block return 301 https://server1.tldn.com$REQUEST_URI;
}

server server1.tldn.com {
listen on $http_ip tls port 443
tls certificate /etc/ssl/server-129.crt
tls key /etc/ssl/private/server-129.key
connection max requests 100
root /htdocs
directory index index.html
location /cgi-bin/* {
fastcgi
root /
}
}

server server2.tldn.com {
listen on 10.0.28.130 port 80
connection max requests 100
root /htdocs-130
directory index index.html
block return 301 https://server2.tldn.com/default.html;
}

server server2.tldn.com {
listen on 10.0.28.130 tls port 443
tls certificate /etc/ssl/server-130.crt
tls key /etc/ssl/private/server-130.key
connection max requests 100
root /htdocs-130
directory index index.html
}

server server3.tldn.com {
listen on 10.0.28.131 port 80
connection max requests 100
root /htdocs-131
directory index index.html
block return 301 https://server3.tldn.com/index.html;
}

server server3.tldn.com {
listen on 10.0.28.131 tls port 443
tls certificate /etc/ssl/server-131.crt
tls key /etc/ssl/private/server-131.key
connection max requests 100
root /htdocs-131
directory index index.html
}


The httpd daemon starts up without an issue and runs without a problem for a few
hours (I have not actually timed it, but it goes for about 6-8 hours), and then
just stops responding.
This is true on all three of the ip's that it is listening on; and whether or
not the original connection is to port 80 or 443.

The access log show all connections until (I presume) the daemon stops answering
connections.

I have tried combining all the ssl certs into one (with a lot of alternative
names), but that did not make a difference.

A simple rcctl restart httpd brings the server back up for a few hours.

I would be happy to investigate this more, but I don't know where to go.
Similarly, I don't know what other information I could provide to help diagnose
the problem.

Any help would be great.

Thanks - ted