Re: [akka-user][deprecated] Re: [akka-user] Spray->Akka-Http Migration - seeing high 99th percentile latencies post-migration

2018-10-09 Thread Johannes Rudolph
That the entity directive is part of the picture could be a hint that
indeed streaming requests might be the cause of this. In spray, there was
no request streaming enabled by default and the engine just collected the
complete stream into a buffer and dispatched it to the app only after
everything was received. This has changed in akka-http where streaming is
on by default if the complete request wasn't received in one go from the
network. In this case the streaming case is actually more likely to happen
on low-traffic servers with a real network where network packages are not
aggregated in lower levels but are really processed immediately when they
are received.

The question is still if the 200ms are really added latency in akka-http or
just an artifact of how request processing time is measured. There's
definitely *some* overhead of processing a request in streaming fashion but
it's not 200ms. I haven't checked seriously but it seems that Kamon might
be measuring something else than you are thinking in akka-http: it seems to
start measuring the time from when the request is dispatched to your app
but at this point the request body might not have been received fully. That
means that whenever the HTTP client is slow with sending a request for
whatever reason, it will show in your request processing times.

Johannes

On Mon, Oct 8, 2018 at 10:42 PM Gary Malouf  wrote:

> We ultimately decided to rollout despite this glitch.  Not happy about it,
> and hoping whatever is causing this gets resolved in a future release.  My
> hunch is that it's a fixed price being paid that if 1000's of more
> requests/second were sent to the app would make this unnoticeable.
>
>
>
> On Sun, Oct 7, 2018 at 11:18 AM Avshalom Manevich 
> wrote:
>
>> Hi Gary,
>>
>> Did you end up finding a solution to this?
>>
>> We're hitting a similar issue with Akka HTTP (10.0.11) and a low-load
>> server.
>>
>> Average latency is great but 99th percentile is horrible (~200ms).
>>
>> Appreciate your input.
>>
>> Regards,
>> Avshalom
>>
>>
>> I wonder if you could start a timer when you enter the trace block and
>>> then e.g. after 200ms trigger one or multiple stack dumps (using JMX or
>>> just by printing out the result of `Thread.getAllStackTraces`). It's not
>>> super likely that something will turn up but it seems like a simple enough
>>> thing to try.
>>>
>>> Johannes
>>>
>>> On Thursday, November 16, 2017 at 1:28:23 PM UTC+1, Gary Malouf wrote:
>>>
 Hi Johannes,

 Yes; we are seeing 2-3 requests/second (only in production) with the
 latency spikes.  We found no correlation between the gc times and these
 request latencies, nor between the size/type of requests.

 We had to pause the migration effort for 2 weeks because of the time
 being taken, but just jumped back on it the other day.

 Our current strategy is to implement this with the low level api to see
 if we get the same results.

 Gary

 On Nov 16, 2017 6:57 AM,  wrote:

 Hi Gary,

 did you find out what's going on by now? If I understand correctly, you
 get latency spikes as soon as you use the `entity[as[String]]` directive?
 Could you narrow down if there's anything special to those requests? I
 guess you monitor your GC times?

 Johannes


 On Wednesday, November 1, 2017 at 8:56:50 PM UTC+1, Gary Malouf wrote:

> So the only way I was able to successfully identify the suspicious
> code was to route a percentage of my production traffic to a stubbed route
> that I incrementally added back pieces of our implementation into.  What I
> found was that we started getting spikes when the entity(as[Case
> ClassFromJson]) stubbed was added back in.  To figure out if it was
> the json parsing or 'POST' entity consumption itself, I replaced that 
> class
> with a string - turns out we experience the latency spikes with that as
> well (on low traffic as noted earlier in this thread).
>
> I by no means have a deep understanding of streams, but it makes me
> wonder if the way I have our code consuming the entity is not correct.
>
> On Monday, October 30, 2017 at 4:27:13 PM UTC-4, Gary Malouf wrote:
>
>> Hi Roland - thank you for the tip.  We shrunk the thread pool size
>> down to 1, but were disheartened to still see the latency spikes.  Using
>> Kamon's tracing library (which we validated with various tests to ensure
>> it's own numbers are most likely correct), we could not find anything in
>> our code within the route that was causing the latency (it all appeared 
>> to
>> be classified to be that route but no code segments within it).
>>
>> As mentioned earlier, running loads of 100-1000 requests/second
>> completely hides the issue (save for the max latency) as everything 
>> through
>> 99th percentiles is under a few milliseconds.
>>
>> On Tuesday, October 

Re: [akka-user][deprecated] Re: [akka-user] Spray->Akka-Http Migration - seeing high 99th percentile latencies post-migration

2018-10-08 Thread Gary Malouf
We ultimately decided to rollout despite this glitch.  Not happy about it,
and hoping whatever is causing this gets resolved in a future release.  My
hunch is that it's a fixed price being paid that if 1000's of more
requests/second were sent to the app would make this unnoticeable.



On Sun, Oct 7, 2018 at 11:18 AM Avshalom Manevich 
wrote:

> Hi Gary,
>
> Did you end up finding a solution to this?
>
> We're hitting a similar issue with Akka HTTP (10.0.11) and a low-load
> server.
>
> Average latency is great but 99th percentile is horrible (~200ms).
>
> Appreciate your input.
>
> Regards,
> Avshalom
>
>
> I wonder if you could start a timer when you enter the trace block and
>> then e.g. after 200ms trigger one or multiple stack dumps (using JMX or
>> just by printing out the result of `Thread.getAllStackTraces`). It's not
>> super likely that something will turn up but it seems like a simple enough
>> thing to try.
>>
>> Johannes
>>
>> On Thursday, November 16, 2017 at 1:28:23 PM UTC+1, Gary Malouf wrote:
>>
>>> Hi Johannes,
>>>
>>> Yes; we are seeing 2-3 requests/second (only in production) with the
>>> latency spikes.  We found no correlation between the gc times and these
>>> request latencies, nor between the size/type of requests.
>>>
>>> We had to pause the migration effort for 2 weeks because of the time
>>> being taken, but just jumped back on it the other day.
>>>
>>> Our current strategy is to implement this with the low level api to see
>>> if we get the same results.
>>>
>>> Gary
>>>
>>> On Nov 16, 2017 6:57 AM,  wrote:
>>>
>>> Hi Gary,
>>>
>>> did you find out what's going on by now? If I understand correctly, you
>>> get latency spikes as soon as you use the `entity[as[String]]` directive?
>>> Could you narrow down if there's anything special to those requests? I
>>> guess you monitor your GC times?
>>>
>>> Johannes
>>>
>>>
>>> On Wednesday, November 1, 2017 at 8:56:50 PM UTC+1, Gary Malouf wrote:
>>>
 So the only way I was able to successfully identify the suspicious code
 was to route a percentage of my production traffic to a stubbed route that
 I incrementally added back pieces of our implementation into.  What I found
 was that we started getting spikes when the entity(as[CaseClassFromJson
 ]) stubbed was added back in.  To figure out if it was the json
 parsing or 'POST' entity consumption itself, I replaced that class with a
 string - turns out we experience the latency spikes with that as well (on
 low traffic as noted earlier in this thread).

 I by no means have a deep understanding of streams, but it makes me
 wonder if the way I have our code consuming the entity is not correct.

 On Monday, October 30, 2017 at 4:27:13 PM UTC-4, Gary Malouf wrote:

> Hi Roland - thank you for the tip.  We shrunk the thread pool size
> down to 1, but were disheartened to still see the latency spikes.  Using
> Kamon's tracing library (which we validated with various tests to ensure
> it's own numbers are most likely correct), we could not find anything in
> our code within the route that was causing the latency (it all appeared to
> be classified to be that route but no code segments within it).
>
> As mentioned earlier, running loads of 100-1000 requests/second
> completely hides the issue (save for the max latency) as everything 
> through
> 99th percentiles is under a few milliseconds.
>
> On Tuesday, October 24, 2017 at 2:23:07 AM UTC-4, rkuhn wrote:
>
>> You could try to decrease your thread pool size to 1 to exclude
>> wakeup latencies when things (like CPU cores) have gone to sleep.
>>
>> Regards, Roland
>>
>> Sent from my iPhone
>>
>> On 23. Oct 2017, at 22:49, Gary Malouf  wrote:
>>
>> Yes, it gets parsed using entity(as[]) with spray-json support.
>> Under a load test of say 1000 requests/second these latencies are not
>> visible in the percentiles - they are easy to see because this web server
>> is getting 10-20 requests/second currently.  Trying to brainstorm if a
>> dispatcher needed to be tuned or something of that sort but have yet to 
>> see
>> evidence supporting that.
>>
>> path("foos") {
>> traceName("FooSelection") {
>>
>> entity(as[ExternalPageRequest]) { pr =>
>> val spr = toSelectionPageRequest(pr)
>> shouldTracePageId(spr.pageId).fold(
>> Tracer.currentContext.withNewSegment(s"Page-${pr.pageId}", "PageTrace
>> ", "kamon") {
>> processPageRequestAndComplete(pr, spr)
>> },
>> processPageRequestAndComplete(pr, spr)
>> )
>> }
>> }
>>
>>
>> }
>>
>> On Mon, Oct 23, 2017 at 4:42 PM, Viktor Klang 
>> wrote:
>>
>>> And you consume the entityBytes I presume?
>>>
>>> On Mon, Oct 23, 2017 at 10:35 PM, Gary Malouf 
>>> wrote:
>>>
 It is from when I start the Kamon trace (just inside of my