Re: Large difference between response time as measured by nginx and as measured by HAProxy

2020-08-27 Thread Willy Tarreau
On Thu, Aug 27, 2020 at 05:39:34PM +0200, Tim Düsterhus, WoltLab GmbH wrote:
> Willy and I discussed this in private. My understanding is that this
> issue is caused by an inherent limitation of H2 with regard to head of
> the line blocking.
> 
> In simple terms:
> 
> If the stuff at the beginning of the the backend connection belongs to a
> client that is very slowly or not reading the stuff within its client
> connection then everything that's within that backend connection will
> stall until the slow client finally gets around to reading the stuff
> thats intended for it.
> 
> Basically a very slow client can block *other* clients that are
> unfortunate and share the backend connection.

And in order to complete this, one theorical workaround would consist
in using very small initial windows, but that would not be practical.
A test just with the buffer size didn't help.

> Our resolution is simple: We disable H2 for the backend connections and
> use good old H1. We don't *need* end to end H2.

Also that raised an idea about adding a new "http-reuse" mode or slightly
changing the way "never" works for H2, which would consist in only grouping
together requests from the same session (client connection). It would be
what "never" already does, except that after the last response, instead of
closing the connection, it would be sent back to the idle pool for any
other client to use. It would not be perfect but would keep streams in
the same order for a given client based on their performance.

Willy



Re: Large difference between response time as measured by nginx and as measured by HAProxy

2020-08-27 Thread Tim Düsterhus , WoltLab GmbH
Hi List,

Am 27.08.20 um 11:45 schrieb Tim Düsterhus, WoltLab GmbH:
>> You can see two h2c connections, one is on the frontend (hence the 'F')
>> and the other one is on the backend. The h2s is the H2 stream, unique to
>> a request (though it comes from a pool, so it will be reused for new
>> streams on any connection once this one is finished). Thus bu just
>> filtering on the h2 and correlating with what you see in the logs (with
>> your source ports), and the date of course, you can quite reliably match
>> connections, hence requests.
> 
> Okay, that works sufficiently well for me. I've extracted the relevant
> parts, redacted any private information and will check with a coworker
> to see if I missed anything. I'll send you the traces + logs in private
> after the contents have been ACKed :-)
> 

Willy and I discussed this in private. My understanding is that this
issue is caused by an inherent limitation of H2 with regard to head of
the line blocking.

In simple terms:

If the stuff at the beginning of the the backend connection belongs to a
client that is very slowly or not reading the stuff within its client
connection then everything that's within that backend connection will
stall until the slow client finally gets around to reading the stuff
thats intended for it.

Basically a very slow client can block *other* clients that are
unfortunate and share the backend connection.

Our resolution is simple: We disable H2 for the backend connections and
use good old H1. We don't *need* end to end H2.

Best regards
Tim Düsterhus
Developer WoltLab GmbH

-- 

WoltLab GmbH
Nedlitzer Str. 27B
14469 Potsdam

Tel.: +49 331 96784338

duester...@woltlab.com
www.woltlab.com

Managing director:
Marcel Werk

AG Potsdam HRB 26795 P



Re: Large difference between response time as measured by nginx and as measured by HAProxy

2020-08-27 Thread Willy Tarreau
On Thu, Aug 27, 2020 at 11:45:13AM +0200, Tim Düsterhus, WoltLab GmbH wrote:
> Willy,
> 
> Am 27.08.20 um 10:29 schrieb Willy Tarreau:
> >> I now have the trace results and my HAProxy log where I can correlate
> >> the slow requests using the timestamp and path. Unfortunately the trace
> >> does not appear to contain the unique-id of the request.
> > 
> > Sadly, due to the way HPACK works we can't decode the headers being sent.
> > There's already a ugly hack being done for the URI and method. I seem to
> > remember we peak them from the HTX's start line, but it could not reliably
> > be done for all headers. I'd still like to improve this :-/
> 
> The unique-id is a property of the the stream (struct stream).unique_id.
> So that one should technically be available, no? I would also be happy
> with using the internal request_counter (the %rt log value), because
> that one is part of my unique ID format anyway.

In the stream output yes, but not in the h1 or h2 output, which do not
necessarily have a stream above them (e.g.: checks). However if we had it
in the stream output, enabling traces at both levels would be enough to
trace them. What I'm missing in both is the conn_stream, which joins the
h1/h2 stream and the application layer stream, but that shouldn't be hard
to add. For the request counter, same problem. I'm trying to think of an
identifier that would be present everywhere but can't think of any. Maybe
we could try to attack the issue the other way around and enable the
production of a trace_id when traces are enabled, and this trace_id would
be passed as much as possible between layers and could also be logged. The
problem with muxed connections is that it would have to be generated on a
new request, and that for a number of issues, that's already too late. Thus
maybe a composite trace ID would help instead, i.e. front connection ID,
front stream ID, back stream ID, back connection ID. Missing some of these
parts would then not be a problem because it would just mean that we didn't
reach that stage.

> > You can see two h2c connections, one is on the frontend (hence the 'F')
> > and the other one is on the backend. The h2s is the H2 stream, unique to
> > a request (though it comes from a pool, so it will be reused for new
> > streams on any connection once this one is finished). Thus bu just
> > filtering on the h2 and correlating with what you see in the logs (with
> > your source ports), and the date of course, you can quite reliably match
> > connections, hence requests.
> 
> Okay, that works sufficiently well for me.

Cool! I do imagine that it has been painful, so thanks for doing it!

> I've extracted the relevant
> parts, redacted any private information and will check with a coworker
> to see if I missed anything. I'll send you the traces + logs in private
> after the contents have been ACKed :-)

No problem.

> There's even a PHP request within there with an 800ms difference between
> nginx and HAProxy.

Ah excellent! We'll have something to study at least. I'm suddenly thinking
about something: there is a known case of head-of-line blocking in H2 when
stream windows are larger than buffers: a slow stream could cause some of
its data to block into the connection's buffer for some time. It could very
well be what's happening to you: merge two clients on the same backend
connection, have one slowly read its data and sometimes the other may
experience some hiccups. This should be significantly reduced by reducing
the number of concurrent streams however. One way to alleviate this is to
reduce the tune.h2.initial-window-size to 16kB but it's used on both sides
so it would also mean very slow posts. The other solution is to increase
the default buffer size to 64kB so that all of an H2 incoming data's stream
window can be flushed into a client's buffer. It may increase your memory
consumption a bit if you're mostly dealing with slow clients. A test could
consist in setting it to 32kB which will generally be enough as you have one
buffer in the mux and another one in the channel ;-)

> > Do not hesitate to let me know if you think of something that can be
> > improved, traces are still quite young and under exploited, I'm quite
> > open to studying potential improvements.
> 
> Nothing else for now. grepping for the pointers worked sufficiently
> well. The bigger issue is identifying the request in question in the
> first place.

Yes I know, that's always the same for me as well. That's the reason
why the unique-id feature was implemented so long ago!

> I was lucky that the file in this case was fairly
> low-trafficked so that I was able to accurately identify it based on the
> timestamp and path.

Usually with patience it does work. There's still a lot of room for
improvement but I often tend to imagine what it would have been without
and that helps me be tolerant :-)

Regards,
Willy



Re: Large difference between response time as measured by nginx and as measured by HAProxy

2020-08-27 Thread Tim Düsterhus , WoltLab GmbH
Willy,

Am 27.08.20 um 10:29 schrieb Willy Tarreau:
>> I now have the trace results and my HAProxy log where I can correlate
>> the slow requests using the timestamp and path. Unfortunately the trace
>> does not appear to contain the unique-id of the request.
> 
> Sadly, due to the way HPACK works we can't decode the headers being sent.
> There's already a ugly hack being done for the URI and method. I seem to
> remember we peak them from the HTX's start line, but it could not reliably
> be done for all headers. I'd still like to improve this :-/

The unique-id is a property of the the stream (struct stream).unique_id.
So that one should technically be available, no? I would also be happy
with using the internal request_counter (the %rt log value), because
that one is part of my unique ID format anyway.

>> Can I somehow filter down the trace file to just the offending requests
>> + possible the requests within the same H2 connection? For privacy
>> reasons I would not like to provide the full trace log, even if it's in
>> a non-public email.
> 
> Of course I understand. You can rely on the "h2c" pointer that appears
> in the trace, it's the H2 connection. For example below, I'm forwarding
> an H2 request from a frontend to a backend:
> 
>   <0>2020-08-27T10:23:26.117936+02:00 [01|h2|0|mux_h2.c:2541] rcvd H2 request 
>  : h2c=0x7f59f0026880(F,FRH) : [1] H2 REQ: GET http://127.0.0.1:4446/ HTTP/2.0
>   <0>2020-08-27T10:23:26.128922+02:00 [01|h2|0|mux_h2.c:5152] sent H2 request 
> : h2c=0x7f59f00382a0(B,FRH) h2s=0x7f59f0037fa0(1,IDL) : [1] H2 REQ: GET 
> http://mail.google.com/ HTTP/2.0
>   <0>2020-08-27T10:23:26.149012+02:00 [01|h2|0|mux_h2.c:2632] rcvd H2 
> response : h2c=0x7f59f00382a0(B,FRH) : [1] H2 RES: HTTP/2.0 301 
>   <0>2020-08-27T10:23:26.149041+02:00 [01|h2|0|mux_h2.c:4790] sent H2 
> response : h2c=0x7f59f0026880(F,FRH) h2s=0x7f59f0027dc0(1,HCR) : [1] H2 RES: 
> HTTP/2.0 301 
> 
> You can see two h2c connections, one is on the frontend (hence the 'F')
> and the other one is on the backend. The h2s is the H2 stream, unique to
> a request (though it comes from a pool, so it will be reused for new
> streams on any connection once this one is finished). Thus bu just
> filtering on the h2 and correlating with what you see in the logs (with
> your source ports), and the date of course, you can quite reliably match
> connections, hence requests.

Okay, that works sufficiently well for me. I've extracted the relevant
parts, redacted any private information and will check with a coworker
to see if I missed anything. I'll send you the traces + logs in private
after the contents have been ACKed :-)

There's even a PHP request within there with an 800ms difference between
nginx and HAProxy.

> Do not hesitate to let me know if you think of something that can be
> improved, traces are still quite young and under exploited, I'm quite
> open to studying potential improvements.

Nothing else for now. grepping for the pointers worked sufficiently
well. The bigger issue is identifying the request in question in the
first place. I was lucky that the file in this case was fairly
low-trafficked so that I was able to accurately identify it based on the
timestamp and path.

Best regards
Tim Düsterhus
Developer WoltLab GmbH

-- 

WoltLab GmbH
Nedlitzer Str. 27B
14469 Potsdam

Tel.: +49 331 96784338

duester...@woltlab.com
www.woltlab.com

Managing director:
Marcel Werk

AG Potsdam HRB 26795 P



Re: Large difference between response time as measured by nginx and as measured by HAProxy

2020-08-27 Thread Willy Tarreau
Hi Tim,

On Thu, Aug 27, 2020 at 10:05:08AM +0200, Tim Düsterhus, WoltLab GmbH wrote:
> My definition of reproducible is "I can write up a list of steps that
> makes the issue happen somewhat reliably".

OK, that's fine by me :-)

> This morning I re-enabled H2 for the backend communication and then
> plugged in the tracing. In the half of an hour since I reenabled H2 I'm
> seeing 160 static requests taking longer than 45ms, with the worst ones
> being > 800ms.

So there's definitely something related to H2.

> I now have the trace results and my HAProxy log where I can correlate
> the slow requests using the timestamp and path. Unfortunately the trace
> does not appear to contain the unique-id of the request.

Sadly, due to the way HPACK works we can't decode the headers being sent.
There's already a ugly hack being done for the URI and method. I seem to
remember we peak them from the HTX's start line, but it could not reliably
be done for all headers. I'd still like to improve this :-/

It's possible to also log the streams (at level developer) and get the
full requests/responses but they will not necessarily correlate with the
H2 traffic (though it could give you high enough confidence).

> Can I somehow filter down the trace file to just the offending requests
> + possible the requests within the same H2 connection? For privacy
> reasons I would not like to provide the full trace log, even if it's in
> a non-public email.

Of course I understand. You can rely on the "h2c" pointer that appears
in the trace, it's the H2 connection. For example below, I'm forwarding
an H2 request from a frontend to a backend:

  <0>2020-08-27T10:23:26.117936+02:00 [01|h2|0|mux_h2.c:2541] rcvd H2 request  
: h2c=0x7f59f0026880(F,FRH) : [1] H2 REQ: GET http://127.0.0.1:4446/ HTTP/2.0
  <0>2020-08-27T10:23:26.128922+02:00 [01|h2|0|mux_h2.c:5152] sent H2 request : 
h2c=0x7f59f00382a0(B,FRH) h2s=0x7f59f0037fa0(1,IDL) : [1] H2 REQ: GET 
http://mail.google.com/ HTTP/2.0
  <0>2020-08-27T10:23:26.149012+02:00 [01|h2|0|mux_h2.c:2632] rcvd H2 response 
: h2c=0x7f59f00382a0(B,FRH) : [1] H2 RES: HTTP/2.0 301 
  <0>2020-08-27T10:23:26.149041+02:00 [01|h2|0|mux_h2.c:4790] sent H2 response 
: h2c=0x7f59f0026880(F,FRH) h2s=0x7f59f0027dc0(1,HCR) : [1] H2 RES: HTTP/2.0 
301 

You can see two h2c connections, one is on the frontend (hence the 'F')
and the other one is on the backend. The h2s is the H2 stream, unique to
a request (though it comes from a pool, so it will be reused for new
streams on any connection once this one is finished). Thus bu just
filtering on the h2 and correlating with what you see in the logs (with
your source ports), and the date of course, you can quite reliably match
connections, hence requests.

Do not hesitate to let me know if you think of something that can be
improved, traces are still quite young and under exploited, I'm quite
open to studying potential improvements.

Willy



Re: Large difference between response time as measured by nginx and as measured by HAProxy

2020-08-27 Thread Tim Düsterhus , WoltLab GmbH
Willy,

Am 26.08.20 um 17:51 schrieb Willy Tarreau
>> As I said: It's not really reproducible.
> 
> Yeah but we all have a definition of "not really reproducible". As you
> consider that it doesn't happen in HTTP/1 after only 4 hours, for me
> this means you expect it to usually happen at least once in 4 hours.
> If the load is not too high (no more than a few thousands requests per
> second) and you have some disk space, the h2 trace method could prove
> to be useful once you have isolated a culprit in the logs. You could
> even gzip the output on the fly to take less space, they compress very
> well.

My definition of reproducible is "I can write up a list of steps that
makes the issue happen somewhat reliably". This is not the case here. If
I test with e.g. nghttp -a to pull the HTML + all resources then
everything is working smoothly. Same if I attempt to pull down the same
static file after I am seeing an issue within the logs.

An update regarding the H1 numbers: In the 20 hours or so with HTTP/1
enabled a total of 15 (!) static requests took longer than 45ms. The
maximum being 77ms. This is still something I consider much, but nothing
compared to the H2 performance.

This morning I re-enabled H2 for the backend communication and then
plugged in the tracing. In the half of an hour since I reenabled H2 I'm
seeing 160 static requests taking longer than 45ms, with the worst ones
being > 800ms.

I now have the trace results and my HAProxy log where I can correlate
the slow requests using the timestamp and path. Unfortunately the trace
does not appear to contain the unique-id of the request.

Can I somehow filter down the trace file to just the offending requests
+ possible the requests within the same H2 connection? For privacy
reasons I would not like to provide the full trace log, even if it's in
a non-public email.

>>> Another thing you can try is to artificially limit
>>> tune.h2.max-concurrent-streams just in case there is contention in
>>> the server's connection buffers. By default it's 100, you can try with
>>> much less (e.g. 20) and see if it seems to make any difference at all.
>>>
>>
>> The fact that disabling HTTP/2 helps could indicate that something like
>> this is the case here. I'll try that tomorrow, thanks.
> 

I've not done this yet, I'd first like to hear how we go about with the
trace I've collected.

Best regards
Tim Düsterhus
Developer WoltLab GmbH

-- 

WoltLab GmbH
Nedlitzer Str. 27B
14469 Potsdam

Tel.: +49 331 96784338

duester...@woltlab.com
www.woltlab.com

Managing director:
Marcel Werk

AG Potsdam HRB 26795 P



Re: Large difference between response time as measured by nginx and as measured by HAProxy

2020-08-26 Thread Willy Tarreau
On Wed, Aug 26, 2020 at 05:38:52PM +0200, Tim Düsterhus, WoltLab GmbH wrote:
> There are even worse ones:
> 
> nginx:
> 
> > Aug 26 09:56:50 *snip* abd21141cdb4[5615]: 5F4631E22E407 "GET *snip*.gif 
> > HTTP/2.0" 200 139 0.000
> 
> HAProxy:
> 
> > Aug 26 09:57:17 *snip* haproxy[3066]: *snip*:57086 
> > [26/Aug/2020:09:56:50.265] https~ *snip*/nginx 0/0/0/27340/27340 200 335 - 
> > -  82/82/17/17/0 0/0 {*snip*|*snip*} {*snip*|} "GET *snip*.gif 
> > HTTP/2.0" 5F4631E22E407
> 
> Yes, that's 27340ms. I don't have any more details regarding the
> connection reuse for that, unfortunately :-(

Ouch!

> FWIW: I disabled HTTP/2 for the backend connection after sending my
> initial email and I am not seeing anything worse than 45ms for static
> files any more.

OK, that already narrows the issue down a bit.

> I just arbitrarily checked 3 PHP requests and they match. The static
> ones *mostly* do as well. It's just that static files taking ages stick
> out like a sore thumb, while PHP tends to be all over the place anyway.
> It basically was a "grep for static file request" and then "halog -pct"
> and wonder why there's a large number at the bottom.

I can understand indeed.

> See above: HTTP/1 did not show the issue within the last 4 hours, so I
> would rule out nginx / the disk being the culprit here.

That's perfect, it definitely rules out that part indeed.

> Static files are served from fast local SSDs with plenty of page cache.
> I don't expect the disk access to bottleneck up to the 3-digit
> millisecond range.

OK. I've been hit in the past by machines swapping out while uploading
some logs over HTTP, and taking ages to even allocate network buffers
after that! They used to slowly recover during the day and start their
madness again next night. It was even causing packet losses causing
retransmissions! That's why I asked :-)

> As I said: It's not really reproducible.

Yeah but we all have a definition of "not really reproducible". As you
consider that it doesn't happen in HTTP/1 after only 4 hours, for me
this means you expect it to usually happen at least once in 4 hours.
If the load is not too high (no more than a few thousands requests per
second) and you have some disk space, the h2 trace method could prove
to be useful once you have isolated a culprit in the logs. You could
even gzip the output on the fly to take less space, they compress very
well.

> I'm mostly seeing it after the
> fact. I can't even tell if HAProxy's measuring or nginx' measuring is
> correct (i.e. which latency the client actually experienced).

Unless there's a bug causing haproxy to report a random value (or a timer
related to the previous request for example and would explain the large
values), I'd be tempted to say that it's likely the client experiences
it :-/

> However tcpdump definitely is not going to be fun, because it's all TLS
> with PFS ciphers.

That's what I was suspecting, sadly, but preferred to ask.

> > Another thing you can try is to artificially limit
> > tune.h2.max-concurrent-streams just in case there is contention in
> > the server's connection buffers. By default it's 100, you can try with
> > much less (e.g. 20) and see if it seems to make any difference at all.
> > 
> 
> The fact that disabling HTTP/2 helps could indicate that something like
> this is the case here. I'll try that tomorrow, thanks.

You're welcome!

Willy



Re: Large difference between response time as measured by nginx and as measured by HAProxy

2020-08-26 Thread Tim Düsterhus , WoltLab GmbH
Willy,

Am 26.08.20 um 16:59 schrieb Willy Tarreau:
> On Wed, Aug 26, 2020 at 12:59:55PM +0200, Tim Düsterhus, WoltLab GmbH wrote:
>> Within the HAProxy logs we're sometimes seeing largish 'Tr' values for
>> static file requests. For example 734ms for the example request below.
> 
> That sounds huge. I could have imagined a 200ms time, possibly indicating
> that somewhere between the two, an extra MSG_MORE was present and prevented
> a request and/or response from leaving the machine. But 734ms doesn't
> correspond to anything usual.

There are even worse ones:

nginx:

> Aug 26 09:56:50 *snip* abd21141cdb4[5615]: 5F4631E22E407 "GET *snip*.gif 
> HTTP/2.0" 200 139 0.000

HAProxy:

> Aug 26 09:57:17 *snip* haproxy[3066]: *snip*:57086 [26/Aug/2020:09:56:50.265] 
> https~ *snip*/nginx 0/0/0/27340/27340 200 335 - -  82/82/17/17/0 0/0 
> {*snip*|*snip*} {*snip*|} "GET *snip*.gif HTTP/2.0" 5F4631E22E407

Yes, that's 27340ms. I don't have any more details regarding the
connection reuse for that, unfortunately :-(

FWIW: I disabled HTTP/2 for the backend connection after sending my
initial email and I am not seeing anything worse than 45ms for static
files any more.

>> The two requests in nginx' log:
>>
>>> Aug 26 10:23:21 *snip* abd21141cdb4[5615]: 5F46381900A0 "GET *snip* 
>>> HTTP/2.0" 200 27049 0.070
>>> Aug 26 10:23:26 *snip* abd21141cdb4[5615]: 5F46381E00F5 "GET *snip*.gif 
>>> HTTP/2.0" 200 6341 0.000
>>
>> Note how the timing for the PHP request matches up what HAProxy was
>> seeing: Roughly 70ms / 0.070s. The timing for the GIF request is *way*
>> off, though.
> 
> Stupid question, do the PHP request *always* match ? I'm asking because

I just arbitrarily checked 3 PHP requests and they match. The static
ones *mostly* do as well. It's just that static files taking ages stick
out like a sore thumb, while PHP tends to be all over the place anyway.

It basically was a "grep for static file request" and then "halog -pct"
and wonder why there's a large number at the bottom.

> one point is that haproxy cannot tell the difference between the two.
> However the static one very likely responds instantly, which could justify
> a less common code path being taken and a bug to be triggered.
> 
> However static files might also cause some disk access, and not knowing
> anything about the environment, I'd say it's also possible that the whole
> nginx process gets frozen for the time it takes open() to respond, and if
> for whatever reason there are lots of I/Os and it takes that long, it might
> be possible that it doesn't notice it, depending where the total time is
> retrieved.

See above: HTTP/1 did not show the issue within the last 4 hours, so I
would rule out nginx / the disk being the culprit here.

Static files are served from fast local SSDs with plenty of page cache.
I don't expect the disk access to bottleneck up to the 3-digit
millisecond range.

> But it could definitely be a bug in the connection reuse or mux code, or
> course. It's just that I have no idea what could cause this for now.
> 
> If you see it often enough, it might be useful to enable h2 traces and
> watch them on the CLI, or even run a tcpdump between the two to observe
> what's going on. The traces might be easier as a first step as you'll
> see already decoded stuff. For example:
> 
>   (echo "trace h2 sink buf0; trace h2 level user; trace h2 verbosity \
>complete; trace h2 start now;show events buf0 -w -n" ; cat ) | \
>   socat - /path/to/socket > /path/to/logs
> 
> Once you're confident you got it, you can disable all traces using the
> short form "trace 0".

As I said: It's not really reproducible. I'm mostly seeing it after the
fact. I can't even tell if HAProxy's measuring or nginx' measuring is
correct (i.e. which latency the client actually experienced).

However tcpdump definitely is not going to be fun, because it's all TLS
with PFS ciphers.

> Another thing you can try is to artificially limit
> tune.h2.max-concurrent-streams just in case there is contention in
> the server's connection buffers. By default it's 100, you can try with
> much less (e.g. 20) and see if it seems to make any difference at all.
> 

The fact that disabling HTTP/2 helps could indicate that something like
this is the case here. I'll try that tomorrow, thanks.

Best regards
Tim Düsterhus
Developer WoltLab GmbH

-- 

WoltLab GmbH
Nedlitzer Str. 27B
14469 Potsdam

Tel.: +49 331 96784338

duester...@woltlab.com
www.woltlab.com

Managing director:
Marcel Werk

AG Potsdam HRB 26795 P



Re: Large difference between response time as measured by nginx and as measured by HAProxy

2020-08-26 Thread Willy Tarreau
Hi Tim,

On Wed, Aug 26, 2020 at 12:59:55PM +0200, Tim Düsterhus, WoltLab GmbH wrote:
> Within the HAProxy logs we're sometimes seeing largish 'Tr' values for
> static file requests. For example 734ms for the example request below.

That sounds huge. I could have imagined a 200ms time, possibly indicating
that somewhere between the two, an extra MSG_MORE was present and prevented
a request and/or response from leaving the machine. But 734ms doesn't
correspond to anything usual.

> The two requests in nginx' log:
> 
> > Aug 26 10:23:21 *snip* abd21141cdb4[5615]: 5F46381900A0 "GET *snip* 
> > HTTP/2.0" 200 27049 0.070
> > Aug 26 10:23:26 *snip* abd21141cdb4[5615]: 5F46381E00F5 "GET *snip*.gif 
> > HTTP/2.0" 200 6341 0.000
> 
> Note how the timing for the PHP request matches up what HAProxy was
> seeing: Roughly 70ms / 0.070s. The timing for the GIF request is *way*
> off, though.

Stupid question, do the PHP request *always* match ? I'm asking because
one point is that haproxy cannot tell the difference between the two.
However the static one very likely responds instantly, which could justify
a less common code path being taken and a bug to be triggered.

However static files might also cause some disk access, and not knowing
anything about the environment, I'd say it's also possible that the whole
nginx process gets frozen for the time it takes open() to respond, and if
for whatever reason there are lots of I/Os and it takes that long, it might
be possible that it doesn't notice it, depending where the total time is
retrieved.

But it could definitely be a bug in the connection reuse or mux code, or
course. It's just that I have no idea what could cause this for now.

If you see it often enough, it might be useful to enable h2 traces and
watch them on the CLI, or even run a tcpdump between the two to observe
what's going on. The traces might be easier as a first step as you'll
see already decoded stuff. For example:

  (echo "trace h2 sink buf0; trace h2 level user; trace h2 verbosity \
   complete; trace h2 start now;show events buf0 -w -n" ; cat ) | \
  socat - /path/to/socket > /path/to/logs

Once you're confident you got it, you can disable all traces using the
short form "trace 0".

Another thing you can try is to artificially limit
tune.h2.max-concurrent-streams just in case there is contention in
the server's connection buffers. By default it's 100, you can try with
much less (e.g. 20) and see if it seems to make any difference at all.

Hoping this helps,
Willy



Re: Large difference between response time as measured by nginx and as measured by HAProxy

2020-08-26 Thread Tim Düsterhus , WoltLab GmbH
Ilya,

Am 26.08.20 um 13:13 schrieb Илья Шипицин:
> nginx buffers output by default.
> 
> can you try
> 
> proxy_request_buffering off;
> proxy_buffering off;
> 

The GIF file with the 734ms difference in response time between the two
logs is a static file served from the local file system.

The proxy_* options do not apply here. In fact nginx is not performing
any proxying at all, it's just doing FastCGI. But FastCGI is not the
thing I am concerned about.

Best regards
Tim Düsterhus
Developer WoltLab GmbH

-- 

WoltLab GmbH
Nedlitzer Str. 27B
14469 Potsdam

Tel.: +49 331 96784338

duester...@woltlab.com
www.woltlab.com

Managing director:
Marcel Werk

AG Potsdam HRB 26795 P



Re: Large difference between response time as measured by nginx and as measured by HAProxy

2020-08-26 Thread Илья Шипицин
nginx buffers output by default.

can you try

proxy_request_buffering off;
proxy_buffering off;

?

ср, 26 авг. 2020 г. в 16:02, Tim Düsterhus, WoltLab GmbH <
duester...@woltlab.com>:

> Hi List,
>
> please keep this email in Cc for all responses, I am not subscribed to
> the list.
>
> We are running HAProxy 2.1.8 in front of nginx serving static files and
> acting as a FastCGI gateway. HAProxy is communicating with nginx using
> HTTP/2. Clients are regular web browsers, mostly using HTTP/2.
>
> Within the HAProxy logs we're sometimes seeing largish 'Tr' values for
> static file requests. For example 734ms for the example request below.
> The corresponding request in the nginx log shows a $request_time of
> 0.000s
> (http://nginx.org/en/docs/http/ngx_http_log_module.html#var_request_time).
>
> HAProxy log line for a static GIF file:
>
> > Aug 26 10:23:27 *snip* haproxy[3066]: *snip*:1167
> [26/Aug/2020:10:23:26.028] https~ *snip*/nginx 0/0/0/734/1719 200 6540 - -
>  19/19/12/12/0 0/0 {*snip*|*snip*} {*snip*|} "GET *snip*.gif HTTP/2.0"
> *snip*:59348 5F46381E00F5
>
> Some notes regarding the log format: This is the standard HTTP log
> format with '%bi:%bp %ID' appended.
>
> According to the '%bp' value this was not the first request within the
> backend connection. The backend connection was established 5 seconds
> earlier for a HTTP/1.1 request going to PHP. According to the request ID
> this was the 160th (00A0) request after a reload (I just recently added
> the logging of %bp).
>
> > Aug 26 10:23:21 *snip* haproxy[3066]: *snip*:11594
> [26/Aug/2020:10:23:21.148] https~ *snip*/nginx 0/0/2/69/72 200 27432 - -
>  19/19/0/0/0 0/0 {*snip*|*snip*} {*snip*|} "GET *snip* HTTP/1.1"
> *snip*:59348 5F46381900A0
>
> The two requests in nginx' log:
>
> > Aug 26 10:23:21 *snip* abd21141cdb4[5615]: 5F46381900A0 "GET *snip*
> HTTP/2.0" 200 27049 0.070
> > Aug 26 10:23:26 *snip* abd21141cdb4[5615]: 5F46381E00F5 "GET *snip*.gif
> HTTP/2.0" 200 6341 0.000
>
> Note how the timing for the PHP request matches up what HAProxy was
> seeing: Roughly 70ms / 0.070s. The timing for the GIF request is *way*
> off, though.
>
> Any idea what is happening there? It's not really reproducible in any
> way. I'll be happy to add more stuff to my logs if that would help. I
> can also provide specific configuration settings on request.
>
> Best regards
> Tim Düsterhus
> Developer WoltLab GmbH
>
> --
>
> WoltLab GmbH
> Nedlitzer Str. 27B
> 14469 Potsdam
>
> Tel.: +49 331 96784338
>
> duester...@woltlab.com
> www.woltlab.com
>
> Managing director:
> Marcel Werk
>
> AG Potsdam HRB 26795 P
>
>