Re: Viewport for man.openbsd.org -- readability on phones

2018-05-17 Thread Jack Burton
On Thu, 17 May 2018 18:32:44 -0400
Aner Perez  wrote:
> First non-comment line of mandoc.css says:
> 
> html {max-width: 100ex; }
> 
> Removing this line allows the use of the full browser width.  I'm
> sure that it was put there for a reason (maybe to approximate the
> width of a terminal?).

Some browsers simply don't calculate lengths expressed in exes correctly
-- seen that in many other contexts. Last time I checked (about 3 years
ago, so it might well have changed since), two of the four most common
browsers still exhibited that fault.

As a quick experiment, try looking up the metrics of the font your
browser actually uses to render man pages, then convert 100ex into ems
for your font and put the result in the max-width attribute in your
local copy of mandoc.css.

If that fixes your width issue then you'll have clear evidence that the
bug lies in the browser (specifically in its routine for converting
exes to whatever its native display length unit is).



Re: httpd stops accepting connections after a few hours on current

2015-07-15 Thread Jack Burton
On Wed, 2015-07-15 at 21:41 +0930, Jack Burton wrote: 
 The fix is trivial -- see attached patch (against 5.7-stable -- sorry,
 I don't have any hosts running -current at present).
... 
 [demime 1.01d removed an attachment of type text/x-patch which had a
 name of httpd_server_accept_tls.patch; charset=UTF-8]

Sorry, didn't realise I couldn't post a patch to the misc@ (I've never
needed to before).

Please excuse my ignorance, but what is the accepted way to contribute a
patch?



Re: httpd stops accepting connections after a few hours on current

2015-07-15 Thread Jack Burton
On Mon, 2015-07-13 at 16:19 +0200, Tor Houghton wrote: 
 On Mon, Jul 13, 2015 at 10:52:46PM +0930, Jack Burton wrote:
   
   I don't pretend to know httpd (at all), but I'm wondering, what should
   fstat(1) say, over time, for the httpd processes?
  
  Thanks Tor -- that was exactly the clue I needed to isolate the
  problem.
  
  [snip]
 
  admin talks to a custom FastCGI daemon, which is most likely the culprit
  -- I'll debug it tomorrow.
... 
 
 I am not sure you should conclude yet. I don't use FastCGI. ;-}
 
 Now, as I write, I have 218 open fd's, compared to the 206 or whatever I had
 in my previous post. I've got a few dangling :443 streams (the :80 ones
 seem to disappear like they should), and then a bunch of these:

You're absolutely right -- I spoke too soon.

After double-checking that every possible path a request could take
through the custom FastCGI daemon used by admin ends by sending an
FCGI_END_REQUEST record back to httpd (it does), I turned my attention
back to the httpd logs  debug messages gathered.

This time I had my little script check the remote IP addresses of those
socket against all the httpd access logs (not just the current ones) and
where nothing matched there, finally check the httpd debug output too.

Again, only the admin server (the only one here that's Internet-facing)
had stale sockets (all open sockets for redir  portal matched log
entries) -- out of 26 open sockets, 4 matched log entries for current
HTTPS sessions, 2 matched buffer event error debug messages and the
other 20 didn't match in either the logs or debug messages.

I still don't know what's causing the buffer event error messages, but
as they accounted for only 2 of the 22 stale sockets, I figured it was
more important to focus on the other 20 first.

So, what sort of HTTPS event doesn't make it into the logs and doesn't
cause any debug messages containing the remote IP address to be emitted
either?

The only thing I could think of was a TCP connection to port 443 where
the remote end doesn't initiate a TLS handshake (that's nowhere near as
improbable as it sounds: think a simple port scan, or a network outage
commencing directly after the first ACK).

So, as a test I tried just that: establishing a TCP session from a
remote host then closing it without sending anything at all at layer 5.

Naturally, doing that where httpd expects plain HTTP causes only a
single debug message to be emitted (...done), and the socket gets
closed as expected.

But doing it where httpd expects HTTPS and the local side of the socket
remains open, nothing appears in the regular logs, and nothing
identifiable by remote IP address appears in debug messages either.

Trying to match log/debug entries that aren't identified by the remote
IP address on a host with even a modest amount of traffic struck me as
an exercise in futility, so I tried the same experiment on another host
(also running 5.7-stable) with no other load on httpd at all.

Result was the same: httpd did not close the socket or log anything in
the regular logs. However, one debug message was emitted, our old friend
server_accept_tls: TLS accept failed - (null)...

...which brings us right back to where this thread started.

Looking at the source, server_accept_tls() handles two types of
non-recoverable error condition: timeout after retry and outright
failure. In the first case (EV_TIMEOUT), server_accept_tls() calls
server_close() (which in turn calls server_close_http(), which closes
the socket) before returning; in the second case it does not.

I believe this is the bug we've been looking for.

The fix is trivial -- see attached patch (against 5.7-stable -- sorry,
I don't have any hosts running -current at present).

That works for me (tested here on two hosts: sparc64 with test load
only; and amd64 with modest production load).

Not sure if that's the best approach or not, but now that we've at
least established root cause, if there's a better way I'm sure someone
else on the list will point it out.

[demime 1.01d removed an attachment of type text/x-patch which had a name of 
httpd_server_accept_tls.patch; charset=UTF-8]



Re: httpd stops accepting connections after a few hours on current

2015-07-15 Thread Jack Burton
On Wed, 2015-07-15 at 12:56 +, Mike Burns wrote: 
 On 2015-07-15 21.49.11 +0930, Jack Burton wrote:
  Sorry, didn't realise I couldn't post a patch to the misc@ (I've never
  needed to before).
  
  Please excuse my ignorance, but what is the accepted way to contribute a
  patch?
 
 Post it to tech@ .

Done. See post to tech@ titled httpd: patch to close TLS sockets that
fail before TLS handshake.



Re: httpd stops accepting connections after a few hours on current

2015-07-13 Thread Jack Burton
On Mon, 2015-07-13 at 11:02 +0200, Tor Houghton wrote: 
 On Sun, Jul 12, 2015 at 07:56:37PM +0930, Jack Burton wrote:
  
  It is possible I simply failed to provision sufficient capacity --
  which could easily be fixed by adding a login class for www with a
  higher limit on open fds -- but I fear that might just be hiding the
  problem rather than addressing it: exhausting a 512 fd limit with with
  peak load of only 48 req/sec (and average load of 2 req/sec) just
  doesn't feel right (especially when that peak load is all 303s
  generated internally by httpd, which each take only a tiny fraction of
  a second to process).
 
 I don't pretend to know httpd (at all), but I'm wondering, what should
 fstat(1) say, over time, for the httpd processes?

Thanks Tor -- that was exactly the clue I needed to isolate the
problem.

Wrote a short script to parse the output of running fstat -p for each
running httpd (we're running with prefork 8, so I didn't fancy doing it
by hand), and report the timestamp of the last request in the relevant
access log of each client IP with an open socket (or 'missing' if no
entry in the current access log).

Ran it roughly 4 hours after the last log rotation and found only 34
matches out of 73 open sockets. We don't run anything here that would
take anywhere near 4 hours to return a response, so the 39 that didn't
match entries in any of the current access logs were clearly where I
needed to look.

All 39 related to admin -- the one HTTPS server that I hadn't spent
any time looking into (since it accounts for only 0.02% of httpd's load
here, it didn't occur to me that that tiny little thing could be
bringing httpd to its knees ... famous last words).

admin talks to a custom FastCGI daemon, which is most likely the culprit
-- I'll debug it tomorrow.

portal (the other HTTPS server) also talks to a (different) custom
FastCGI daemon, but carries orders of magnitude more traffic and didn't
have any stale sockets -- so clearly our problem is at the other end of
admin's FastCGI socket (not with httpd itself). Sorry for the noise.

Ted -- similarly, you may want to look into whatever is at the other end
of your server1's FastCGI socket. If your issue is the same as ours,
that's likely where you'll find the cause.



Re: httpd stops accepting connections after a few hours on current

2015-07-12 Thread Jack Burton
On Sat, 2015-07-11 at 15:38 +0930, Jack Burton wrote: 
 It hasn't happened here in a few days now so I don't have a log extract
 on hand to share (but can post one next time it happens).

Okay, the issue returned this afternoon and the httpd debug output
certainly sheds more light on the problem.

This time we didn't see either the TLS or buffer event errors anywhere
near the time at which httpd stopped responding to requests.

Instead, we're getting server_accept: deferring connections. According
to the comments in server.c, that means we're running out of file
descriptors.

That struck me as odd, as our traffic generally isn't anywhere near high
enough to expect that, so I checked the traffic at the time and there
was indeed a spike although it didn't seem high enough to cause issues.
Peak load was 48 requests in the one second before httpd stopped
responding to requests.

All 48 of those requests were to the trivial http server, whose config
is just:

  listen on $int_addr port 80
  block return 303 https://portal.tvir.acscomp.net;

(yes I know that that hostname doesn't resolve publicly -- but it does
when using the resolver assigned by dhcp on the semi-public [but not
Internet-facing] network on which our httpd listens)

As an aside, I didn't see in the debug output any requests during that
final second [although there were two a couple of seconds later] to the
target https server portal (which is served by the same instance of
httpd) -- but I guess it's possible that all 48 clients either didn't
act on the 303 or already had its target in their caches (environment
is a residential building for tertiary students, so the user base is
fairly static at this time of year -- so seems well within the realms
of possibility that all 48 had / on portal cached).

Debug output at the time httpd stopped responding reads (after 47 other
requests to the trivial http server all timestamped 16:08:54):

redir 192.168.137.160 - - [12/Jul/2015:16:08:54 +0930] GET /personal
HTTP/1.1 303 0
server redir, client 119933 (505 active), 192.168.137.160:40521 -
192.168.137.1, https://portal.tvir.acscomp.net (303 See Other)
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server redir, client 119935 (505 active), 192.168.137.160:45643 -
192.168.137.1, done
server redir, client 119934 (504 active), 192.168.137.160:40526 -
192.168.137.1, done
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server redir, client 119936 (505 active), 192.168.137.160:47925 -
192.168.137.1, done
server_accept: deferring connections
server_accept: deferring connections
server redir, client 119938 (505 active), 192.168.137.160:40528 -
192.168.137.1, done
server redir, client 119937 (504 active), 192.168.137.160:40527 -
192.168.137.1, done
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server redir, client 119940 (505 active), 192.168.137.160:37213 -
192.168.137.1, done
server_accept: deferring connections
server_accept: deferring connections
portal.tvir.acscomp.net 192.168.137.99 - - [12/Jul/2015:16:08:56 +0930]
GET / HTTP/1.1 200 0
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
portal.tvir.acscomp.net 192.168.137.112 - - [12/Jul/2015:16:08:57 +0930]
GET / HTTP/1.1 200 0
server_accept: deferring connections

Then nothing but server_accept: deferring connections over and over
again.

It is possible I simply failed to provision sufficient capacity --
which could easily be fixed by adding a login class for www with a
higher limit on open fds -- but I fear that might just be hiding the
problem rather than addressing it: exhausting a 512 fd limit with with
peak load of only 48 req/sec (and average load of 2 req/sec) just
doesn't feel right (especially when that peak load is all 303s
generated internally by httpd, which each take only a tiny fraction of
a second to process).

I notice in the source that server_close_http() is responsible for
freeing session-specific fds, and that it's called from server_close(),
which is also responsible for generating the ..., done debug messages
and decrementing the active client count.

We're only seeing those ..., done messages in the debug output for a
small proportion of completed HTTP sessions, and the active client count
continues to grow (and only falls occasionally), even when there is much
less HTTP traffic.

Is seems as if some HTTP sessions get their fds freed on completion
while others don't ... but I can't find anything in the source to
support that conjecture.

Could someone who's more familiar with httpd than I am offer a clue
please?



Re: httpd stops accepting connections after a few hours on current

2015-07-11 Thread Jack Burton
On Thu, 2015-07-09 at 11:59 +0200, Tor Houghton wrote: 
 On Wed, Jul 08, 2015 at 10:04:27PM -0500, Theodore Wynnychenko wrote:
  
  [snip]
 
  server https://server2.tldn.com, client 2067 (63 active), 10.0.28.254:60330 
  -
  10.0.28.130:443, buffer event error
  [..]
  server https://server2.tldn.com, client 2068 (63 active), 10.0.28.254:52350 
  -
  10.0.28.130:443, buffer event error
 
 I'm going to me too on this one (have not been until now, as I thought
 perhaps it was due to my setup, and therefore off-topic).

Likewise, seeing the same behaviour here on 5.7-stable -- so the
problem is not confined to -current.

Fairly small  simple httpd setup here, httpd configured with 3 server
stanzas: 2 HTTPS-only (both using FastCGI) plus one trivial HTTP-only
(just a block return 303 pointing to one of the HTTPS servers). Quite a
light load too (averaging 178k requests/day -- about 2/sec).

Frequency of problem varies wildly -- sometimes occurs after only an
hour or two since last httpd restart and at other times httpd will last
for up to 4 days before it stops responding to requests. Variation in
volume of requests appears to have no effect on frequency of recurrence
either.

On every occasion, httpd continues to respond correctly to signals
(httpd restarts are always clean), just not to HTTP[S] requests.

On at least one occasion, the http socket continued to respond correctly
to requests, whilst the two https ones stopped responding. On other
occasions, all 3 stopped responding at around the same time.

When a socket stops responding, it still accepts requests but httpd
neither logs (at least, when not in debug mode) nor responds to them
(i.e. I can successfully open a TCP session to the listening socket and
send it a request, but nothing comes back after the initial ACK).

It hasn't happened here in a few days now so I don't have a log extract
on hand to share (but can post one next time it happens).

From memory in the past we were seeing TLS accept fail errors in the
logs, as reported by the original poster, but not at the time the
sockets stopped responding (only well beforehand), so I'd also assumed
that those were unrelated. Running tcpdump on both user-facing
interfaces (and on pflog0 just to rule out the possibility of some
error in our pf.conf) whilst httpd was not responding to requests on
previous occasions revealed nothing new.

Have tried watching debug output a couple of times before, but it
rapidly gets quite unwieldy, even with our modest load (especially over
a remote ssh session -- both uplinks at that site are nearing
capacity), given the length of time it can take for the problem to
manifest (on each occasion I gave up after a few hours without the
problem occurring).

Am now running httpd -dvvv with stdout/err redirected to a temporary log
file (probably should have done that in the first place).

We are already seeing (after less than a minute) entries in the debug
logs similar to those reported by Theodore, for example:

* On an HTTPS server (using FastCGI):
server portal, client 305 (14 active), 192.168.137.161:52224 -
192.168.137.1:443, buffer event error

and

* On the trivial HTTP server (using just a block return 303):
server redir, client 132 (11 active), 192.168.137.100:61081 -
192.168.137.1, buffer event timeout

However, the original problem (httpd stops responding to requests) is
*not* occurring at present.

Will post debug log extract  httpd.conf next time the problem recurs
(should be within the next few days).