Re: Complete Latency Vs. Throughput--when do they not change in same direction?

John Yost Sat, 02 Apr 2016 18:52:00 -0700

Hey Nikos,

Yep, what we're describing here is precisely why there are TONS of DevOps
jobs available. These systems are, an times, highly difficult to
troubleshoot.  But...unless there is a DEFCON1 deadline, it's normally
quite fun to figure all of this out.


Regarding Storm's heavy use of Zookeeper, that's a very important point and
precisely why we have a separate Zk cluster solely for Storm. We learned
that having both Kafka and Storm use the same cluster is something to avoid
if possible.

I know that Bobby Evans and the Yahoo team have done some great work on a
separate heartbeat server that bypasses Zookeeper (mentioned in this
presentation http://bit.ly/1GKmHi6).
--John

On Sat, Apr 2, 2016 at 10:02 AM, Nikos R. Katsipoulakis <
[email protected]> wrote:

> Hello again,
>
> Well, it is not hand-waving... It is just hard to trace these issues in a
> real system.
>
> Definitely, varying behavior on Kafka makes all metrics unreliable and it
> will definitely create unstable behavior on your pipeline. Remember, Kafka
> is a distributed system itself, relying on ZooKeeper, which in turn is
> heavily abused by Storm. Therefore, there is an "entanglement" between
> Kafka and Storm in your setting. So, yes, your explanation sounds
> reasonable.
>
> Overall, I think your observations captured that heavy usage of Kafka, and
> therefore can not be considered as the norm (they are rather an anomaly).
> Interesting problem though! I have been battling latency and throughput
> myself with Storm, and I find it a fascinating subject when one is dealing
> with real-time data analysis.
>
> Cheers,
> Nikos
>
> On Sat, Apr 2, 2016 at 9:35 AM, John Yost <[email protected]> wrote:
>
>> Hey Nikos,
>>
>> I believe I have it figured out.  I spoke with another DevOps-err and she
>> said we had a large backlog of data that's been hitting Kafka for the last
>> couple of days, so our Kafka cluster has been under heavy load.
>> Consequently, my hypothesis that that, despite the fact my topo is reading
>> off of a static Kafka partition, there is significant variability in the
>> KafkaSpout read rates is likely correct; this could explain the lower
>> throughput combined with lower Complete Latency. In addition, since Kafka
>> is under heavy load, there is likely corresponding variability in the Kafka
>> writes; this could explain the tuple failures coupled with the lower
>> throughput and Complete Latency.
>>
>> There's definitely a bit of hand waving here, but the picture as I wrote
>> it out here seems reasonable to me. What do you think?
>>
>> Thanks
>>
>> --John
>>
>> On Fri, Apr 1, 2016 at 1:43 PM, Nikos R. Katsipoulakis <
>> [email protected]> wrote:
>>
>>> Glad that we are on the same page.
>>>
>>> The observation you present in your last paragraph is quite hard to
>>> answer. Before I present you my explanation, I assume you are using more
>>> than one machines (i.e. using Netty for sending/receiving tuples). If the
>>> previous is the case, then I am pretty sure that if you check your workers'
>>> logs you will say a bunch of Netty events for failed messages. On the other
>>> scenario, that you are running Storm on a single node, and you are using
>>> LMAX Disruptor, you should be using some different strategy than
>>> BlockingWaitStrategy, in order to see failed messages. Please, clarify your
>>> setup a bit further.
>>>
>>> Carrying on, if tuple failures happen in the Netty layer, then one
>>> explanation might be that something causes time-outs in your system (either
>>> context-switching or something else) and tuples are considered failed.
>>> Concretely, if you have a low input rate, then your Netty buffer needs more
>>> time to fill up a batch before sending it to the next node. Therefore, if
>>> your timeout is somewhere close to the time the oldest tuple spends in your
>>> Netty buffer, you might end up flagging that tuple as failed and having to
>>> re-send it.
>>>
>>> The previous is one possible scenario. Nevertheless, what you are
>>> experiencing might be the outcome of any of the layers of Storm and is hard
>>> to pinpoint exactly.
>>>
>>> Cheers,
>>> Nikos
>>>
>>>
>>> On Fri, Apr 1, 2016 at 1:02 PM, John Yost <[email protected]> wrote:
>>>
>>>> Hey Nikos,
>>>>
>>>> Yep, totally agree with what you've written here. As I wrote in my
>>>> response, the key point is whether the input rate differs. If the input
>>>> rate is lower, totally agree that it makes sense that both the throughput
>>>> and Complete Latency going down makes good sense. Similarly, if the rate
>>>> goes up, the Complete Latency goes up since there's more data for the topo
>>>> to process.
>>>>
>>>> For my purposes, I need to confirm if the rate of data coming in--in
>>>> other words, the rate at which KafkaSpout is reading off of a static Kafka
>>>> topic--varies between experiments. Up until now I've discounted this
>>>> possibility, but, given the great points you've made here, I'm going to
>>>> revisit this possibility.
>>>>
>>>> Having written all of this...another important point: when I see the
>>>> throughput and Complete Latency decrease in tandem, I also see tuple
>>>> failures go from zero at or close to best throughput, to between 1-5%
>>>> failure rate when throughput and Complete Latency decrease. This seems to
>>>> contradict what you and I are discussing here, but I may be missing
>>>> something.  What do you think?
>>>>
>>>> --John
>>>>
>>>> On Fri, Apr 1, 2016 at 12:34 PM, Nikos R. Katsipoulakis <
>>>> [email protected]> wrote:
>>>>
>>>>> Hello again John,
>>>>>
>>>>> No need to apologize. All experiments in distributed environments have
>>>>> so many details, that it is only normal to forget some of them in an
>>>>> initial explanation.
>>>>>
>>>>> Going back to my point on seeing less throughput with less latency, I
>>>>> will give you an example:
>>>>>
>>>>> Assume you own a system, which has the resources to handle up to 100
>>>>> events/sec and guarantees that you get a mean latency of 5 msec. If you 
>>>>> run
>>>>> an experiment in which you send in 100 events/sec, your system runs in a
>>>>> 100% capacity and you monitor 5 msec end-to-end latency. Your throughput 
>>>>> is
>>>>> expected to be somewhere close to 100 events/sec (but lower if you factor
>>>>> in latency). Now, if you run another experiment in which you send in 50
>>>>> events/sec, your system runs at 50% capacity and you monitor an average
>>>>> end-to-end latency somewhere around 2.5 msec. In the second experiment, 
>>>>> you
>>>>> are expected to see lower throughput compared to the first experiment and
>>>>> somewhere around 50 events/sec.
>>>>>
>>>>> Of course, in the example above I assumed that there is a 1:1 mapping
>>>>> between each input data point and each output. If that is not the case,
>>>>> then you have to give more details.
>>>>>
>>>>> Thanks,
>>>>> Nikos
>>>>>
>>>>> On Fri, Apr 1, 2016 at 12:20 PM, John Yost <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hey Nikos,
>>>>>>
>>>>>> Thanks for responding so quickly, and I apologize for leaving out a
>>>>>> crucially important detail--the Kafka topic. My topo is reading from a
>>>>>> static topic.  I definitely agree that reading from a live topic 
>>>>>> could--and
>>>>>> would likely--lead to variable throughput rates, both in terms of raw 
>>>>>> input
>>>>>> rates as well as variability in the content. Again, great question and
>>>>>> points, I should have specified my topo is reading from a static Kafka
>>>>>> topic in my original post.
>>>>>>
>>>>>> Regarding your third point, my thinking is that throughput would go
>>>>>> up if Complete Latency went down since its my understanding that Complete
>>>>>> Latency measures the avg amount of time that each tuple spends in the
>>>>>> topology. The key if here is if the input rate stays the same. If 
>>>>>> Complete
>>>>>> Latency decreases, more tuples can be processed by the topology in a 
>>>>>> given
>>>>>> amount time. But I see what you're saying the avg time spent on each 
>>>>>> tuple
>>>>>> would be less if the input rate goes up because there's more data per
>>>>>> second, more context switching amongst the executors, etc... Please 
>>>>>> confirm
>>>>>> if I am thinking about this the wrong way, because this seems to be a
>>>>>> pretty fundamental fact about Storm that I need to have right.
>>>>>>
>>>>>> Great point regarding waiting for topology to complete warm up.  I
>>>>>> let my topo run for 20 minutes before measuring anything.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> --John
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 1, 2016 at 9:54 AM, Nikos R. Katsipoulakis <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hello John,
>>>>>>>
>>>>>>> I have to say that a system's telemetry is not a mystery easily
>>>>>>> understood. Then, let us try to deduce what might be the case in your
>>>>>>> use-case that causes inconsistent performance metrics.
>>>>>>>
>>>>>>> At first, I would like to ask if your KafkaSpout's produce tuples
>>>>>>> with the same rate. In other words, do you produce or read data in a
>>>>>>> deterministic (replay-able) way; or do you attach your KafkaSpout to a
>>>>>>> non-controllable source of data (like Twitter feed, news feed etc)? The
>>>>>>> reason I am asking is because figuring out what happens in the source of
>>>>>>> your data (in terms of input rate) is really important. If your use-case
>>>>>>> involves varying input-rate for your sources, I would suggest picking a
>>>>>>> particular snapshot of that source, and replay your experiments in 
>>>>>>> order to
>>>>>>> check if the variance in latency/throughput still exists.
>>>>>>>
>>>>>>> The second point I would like to make is that sometimes throughput
>>>>>>> (or ack-rate as you correctly put it) might be related to the data you 
>>>>>>> are
>>>>>>> pushing. For instance, a computation-heavy task might take more time 
>>>>>>> for a
>>>>>>> particular value distribution than for another. Therefore, please make 
>>>>>>> sure
>>>>>>> that the data you send in the system always cause the same amount of
>>>>>>> computation.
>>>>>>>
>>>>>>> And third, noticing dropping throughput and latency at the same time
>>>>>>> immediately points to a dropped input rate. Think about it. If I send in
>>>>>>> tuples with a lower input rate, I expect throughput to drop (since I am
>>>>>>> sending tuples with a lower input rate), and at the same time the heavy
>>>>>>> computation has to work with less data (thus end-to-end latency also
>>>>>>> drops). Does the previous make sense to you? Can you verify that among 
>>>>>>> the
>>>>>>> different runs, you had consistent input rates?
>>>>>>>
>>>>>>> Finally, I would suggest to you that you let Storm warm-up and drop
>>>>>>> your initial metrics. In my experience with Storm, latency and 
>>>>>>> throughput,
>>>>>>> in the beginning of a task (until all buffers get full), are highly
>>>>>>> variable, and therefore, not reliable data points to include in your
>>>>>>> analysis. You can verify my claim by doing an overtime plot of your 
>>>>>>> data.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Nikos
>>>>>>>
>>>>>>> On Fri, Apr 1, 2016 at 9:16 AM, John Yost <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Everyone,
>>>>>>>>
>>>>>>>> I am a little puzzled by what I am seeing in some testing with a
>>>>>>>> topology I have where the topo is reading from a KafkaSpout, doing 
>>>>>>>> some CPU
>>>>>>>> intensive processing, and then writing out to Kafka via the standard
>>>>>>>> KafkaBolt.
>>>>>>>>
>>>>>>>> I am doing testing in a multi-tenant environment and so test
>>>>>>>> results can vary by 10-20% on average.  However, results are much more
>>>>>>>> variable the last couple of days.
>>>>>>>>
>>>>>>>> The big thing I am noticing: whereas the throughput--as measured in
>>>>>>>> tuples acked/minute--is half today of what it was yesterday for the 
>>>>>>>> same
>>>>>>>> configuraton, the Complete Latency (total time a tuple is in the 
>>>>>>>> topology
>>>>>>>> from the time it hits the KafkaSpout to the time it is acked in the
>>>>>>>> KafkaBolt) today is a third of what it was yesterday.
>>>>>>>>
>>>>>>>> Any ideas as to how the throughput could go down dramatically at
>>>>>>>> the same time the Complete Latency is improving?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> --John
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nikos R. Katsipoulakis,
>>>>>>> Department of Computer Science
>>>>>>> University of Pittsburgh
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nikos R. Katsipoulakis,
>>>>> Department of Computer Science
>>>>> University of Pittsburgh
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nikos R. Katsipoulakis,
>>> Department of Computer Science
>>> University of Pittsburgh
>>>
>>
>>
>
>
> --
> Nikos R. Katsipoulakis,
> Department of Computer Science
> University of Pittsburgh
>

Re: Complete Latency Vs. Throughput--when do they not change in same direction?

Reply via email to