Re: Debug Samza consumer lag issue

2016-08-24 Thread David Yu
Make sense. Thanks for the help, Jake! On Wed, Aug 24, 2016 at 5:11 PM Jacob Maes wrote: > We don't have any hard guidelines around that metric just because there are > no hard rules that work for every job. For example, some jobs are very > bursty and need to keep up with

Re: Debug Samza consumer lag issue

2016-08-24 Thread Jacob Maes
We don't have any hard guidelines around that metric just because there are no hard rules that work for every job. For example, some jobs are very bursty and need to keep up with huge traffic ramp-ups even though they're underutilized the rest of the time. That said, yes, I have used that metric

Re: Debug Samza consumer lag issue

2016-08-24 Thread David Yu
Interesting. To me, "event-loop-utilization" looks like a good indicator that shows us how busy the containers are. Is it safe to use this metric as a reference when we need to scale out/in our job? For example, if I'm seeing around 0.3 utilization most of the time, maybe I can decrease the # of

Re: Debug Samza consumer lag issue

2016-08-24 Thread Jacob Maes
> > Based on what you have described, the following should be true in 0.10.1: > event-loop-ns = choose-ns + process-ns + window-ns (if necessary) + > commit-ns (if necessary) Yes, plus any time (e.g. due to an unlucky GC at just the right moment) that happens outside those timers. And no "if

Re: Debug Samza consumer lag issue

2016-08-24 Thread David Yu
Great. It all makes sense now. With the SSD fix, we also upgrade to 0.10.1. So we should see pretty consistent process-ns (which we do). Based on what you have described, the following should be true in 0.10.1: event-loop-ns = choose-ns + process-ns + window-ns (if necessary) + commit-ns (if

Re: Debug Samza consumer lag issue

2016-08-24 Thread Jacob Maes
A couple other notes. Prior to Samza 10.1, the choose-ns was part of process-ns. So when choose-ns and process-ns are both high (around 10,000,000 == 10ms, which is the default poll timeout), that usually means the task is caught up. In Samza 10.1 the same is true if ONLY choose-ns is high.

Re: Debug Samza consumer lag issue

2016-08-24 Thread Jacob Maes
Hey David, Answering the most recent question first, since it's also the easiest. :-) Is choose-ns the total number of ms used to choose a message from the input > stream? What are some gating factors (e.g. serialization?) for this metric? It's the amount of time the event loop spent getting

Re: Debug Samza consumer lag issue

2016-08-24 Thread David Yu
More updates: 1. process-envelopes rate finally stabilized and converged. Consumer lag is down to zero. 2. avg choose-ns across containers dropped overtime , which I assume is a good thing. My question: Is

Re: Debug Samza consumer lag issue

2016-08-23 Thread David Yu
Some metric updates: 1. We started seeing some containers with a higher choose-ns . Not sure what would be the cause of this. 2. We are seeing very different process-envelopes values across containers

Re: Debug Samza consumer lag issue

2016-08-23 Thread David Yu
Hi, Jake, Thanks for your suggestions. Some of my answers inline: 1. On Tue, Aug 23, 2016 at 11:53 AM Jacob Maes wrote: > Hey David, > > A few initial thoughts/questions: > > >1. Is this job using RocksDB to store the aggregations? If so, is it >running on a

Debug Samza consumer lag issue

2016-08-23 Thread David Yu
Dear Samza guys, We are here for some debugging suggestions on our Samza job (0.10.0), which lags behind on consumption after running for a couple of hours, regardless of the number of containers allocated (currently 5). Briefly, the job aggregates events into sessions (in Avro) during process()