Hey Nikos, Yep, what we're describing here is precisely why there are TONS of DevOps jobs available. These systems are, an times, highly difficult to troubleshoot. But...unless there is a DEFCON1 deadline, it's normally quite fun to figure all of this out.
Regarding Storm's heavy use of Zookeeper, that's a very important point and precisely why we have a separate Zk cluster solely for Storm. We learned that having both Kafka and Storm use the same cluster is something to avoid if possible. I know that Bobby Evans and the Yahoo team have done some great work on a separate heartbeat server that bypasses Zookeeper (mentioned in this presentation http://bit.ly/1GKmHi6). --John On Sat, Apr 2, 2016 at 10:02 AM, Nikos R. Katsipoulakis < [email protected]> wrote: > Hello again, > > Well, it is not hand-waving... It is just hard to trace these issues in a > real system. > > Definitely, varying behavior on Kafka makes all metrics unreliable and it > will definitely create unstable behavior on your pipeline. Remember, Kafka > is a distributed system itself, relying on ZooKeeper, which in turn is > heavily abused by Storm. Therefore, there is an "entanglement" between > Kafka and Storm in your setting. So, yes, your explanation sounds > reasonable. > > Overall, I think your observations captured that heavy usage of Kafka, and > therefore can not be considered as the norm (they are rather an anomaly). > Interesting problem though! I have been battling latency and throughput > myself with Storm, and I find it a fascinating subject when one is dealing > with real-time data analysis. > > Cheers, > Nikos > > On Sat, Apr 2, 2016 at 9:35 AM, John Yost <[email protected]> wrote: > >> Hey Nikos, >> >> I believe I have it figured out. I spoke with another DevOps-err and she >> said we had a large backlog of data that's been hitting Kafka for the last >> couple of days, so our Kafka cluster has been under heavy load. >> Consequently, my hypothesis that that, despite the fact my topo is reading >> off of a static Kafka partition, there is significant variability in the >> KafkaSpout read rates is likely correct; this could explain the lower >> throughput combined with lower Complete Latency. In addition, since Kafka >> is under heavy load, there is likely corresponding variability in the Kafka >> writes; this could explain the tuple failures coupled with the lower >> throughput and Complete Latency. >> >> There's definitely a bit of hand waving here, but the picture as I wrote >> it out here seems reasonable to me. What do you think? >> >> Thanks >> >> --John >> >> On Fri, Apr 1, 2016 at 1:43 PM, Nikos R. Katsipoulakis < >> [email protected]> wrote: >> >>> Glad that we are on the same page. >>> >>> The observation you present in your last paragraph is quite hard to >>> answer. Before I present you my explanation, I assume you are using more >>> than one machines (i.e. using Netty for sending/receiving tuples). If the >>> previous is the case, then I am pretty sure that if you check your workers' >>> logs you will say a bunch of Netty events for failed messages. On the other >>> scenario, that you are running Storm on a single node, and you are using >>> LMAX Disruptor, you should be using some different strategy than >>> BlockingWaitStrategy, in order to see failed messages. Please, clarify your >>> setup a bit further. >>> >>> Carrying on, if tuple failures happen in the Netty layer, then one >>> explanation might be that something causes time-outs in your system (either >>> context-switching or something else) and tuples are considered failed. >>> Concretely, if you have a low input rate, then your Netty buffer needs more >>> time to fill up a batch before sending it to the next node. Therefore, if >>> your timeout is somewhere close to the time the oldest tuple spends in your >>> Netty buffer, you might end up flagging that tuple as failed and having to >>> re-send it. >>> >>> The previous is one possible scenario. Nevertheless, what you are >>> experiencing might be the outcome of any of the layers of Storm and is hard >>> to pinpoint exactly. >>> >>> Cheers, >>> Nikos >>> >>> >>> On Fri, Apr 1, 2016 at 1:02 PM, John Yost <[email protected]> wrote: >>> >>>> Hey Nikos, >>>> >>>> Yep, totally agree with what you've written here. As I wrote in my >>>> response, the key point is whether the input rate differs. If the input >>>> rate is lower, totally agree that it makes sense that both the throughput >>>> and Complete Latency going down makes good sense. Similarly, if the rate >>>> goes up, the Complete Latency goes up since there's more data for the topo >>>> to process. >>>> >>>> For my purposes, I need to confirm if the rate of data coming in--in >>>> other words, the rate at which KafkaSpout is reading off of a static Kafka >>>> topic--varies between experiments. Up until now I've discounted this >>>> possibility, but, given the great points you've made here, I'm going to >>>> revisit this possibility. >>>> >>>> Having written all of this...another important point: when I see the >>>> throughput and Complete Latency decrease in tandem, I also see tuple >>>> failures go from zero at or close to best throughput, to between 1-5% >>>> failure rate when throughput and Complete Latency decrease. This seems to >>>> contradict what you and I are discussing here, but I may be missing >>>> something. What do you think? >>>> >>>> --John >>>> >>>> On Fri, Apr 1, 2016 at 12:34 PM, Nikos R. Katsipoulakis < >>>> [email protected]> wrote: >>>> >>>>> Hello again John, >>>>> >>>>> No need to apologize. All experiments in distributed environments have >>>>> so many details, that it is only normal to forget some of them in an >>>>> initial explanation. >>>>> >>>>> Going back to my point on seeing less throughput with less latency, I >>>>> will give you an example: >>>>> >>>>> Assume you own a system, which has the resources to handle up to 100 >>>>> events/sec and guarantees that you get a mean latency of 5 msec. If you >>>>> run >>>>> an experiment in which you send in 100 events/sec, your system runs in a >>>>> 100% capacity and you monitor 5 msec end-to-end latency. Your throughput >>>>> is >>>>> expected to be somewhere close to 100 events/sec (but lower if you factor >>>>> in latency). Now, if you run another experiment in which you send in 50 >>>>> events/sec, your system runs at 50% capacity and you monitor an average >>>>> end-to-end latency somewhere around 2.5 msec. In the second experiment, >>>>> you >>>>> are expected to see lower throughput compared to the first experiment and >>>>> somewhere around 50 events/sec. >>>>> >>>>> Of course, in the example above I assumed that there is a 1:1 mapping >>>>> between each input data point and each output. If that is not the case, >>>>> then you have to give more details. >>>>> >>>>> Thanks, >>>>> Nikos >>>>> >>>>> On Fri, Apr 1, 2016 at 12:20 PM, John Yost <[email protected]> >>>>> wrote: >>>>> >>>>>> Hey Nikos, >>>>>> >>>>>> Thanks for responding so quickly, and I apologize for leaving out a >>>>>> crucially important detail--the Kafka topic. My topo is reading from a >>>>>> static topic. I definitely agree that reading from a live topic >>>>>> could--and >>>>>> would likely--lead to variable throughput rates, both in terms of raw >>>>>> input >>>>>> rates as well as variability in the content. Again, great question and >>>>>> points, I should have specified my topo is reading from a static Kafka >>>>>> topic in my original post. >>>>>> >>>>>> Regarding your third point, my thinking is that throughput would go >>>>>> up if Complete Latency went down since its my understanding that Complete >>>>>> Latency measures the avg amount of time that each tuple spends in the >>>>>> topology. The key if here is if the input rate stays the same. If >>>>>> Complete >>>>>> Latency decreases, more tuples can be processed by the topology in a >>>>>> given >>>>>> amount time. But I see what you're saying the avg time spent on each >>>>>> tuple >>>>>> would be less if the input rate goes up because there's more data per >>>>>> second, more context switching amongst the executors, etc... Please >>>>>> confirm >>>>>> if I am thinking about this the wrong way, because this seems to be a >>>>>> pretty fundamental fact about Storm that I need to have right. >>>>>> >>>>>> Great point regarding waiting for topology to complete warm up. I >>>>>> let my topo run for 20 minutes before measuring anything. >>>>>> >>>>>> Thanks >>>>>> >>>>>> --John >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Apr 1, 2016 at 9:54 AM, Nikos R. Katsipoulakis < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hello John, >>>>>>> >>>>>>> I have to say that a system's telemetry is not a mystery easily >>>>>>> understood. Then, let us try to deduce what might be the case in your >>>>>>> use-case that causes inconsistent performance metrics. >>>>>>> >>>>>>> At first, I would like to ask if your KafkaSpout's produce tuples >>>>>>> with the same rate. In other words, do you produce or read data in a >>>>>>> deterministic (replay-able) way; or do you attach your KafkaSpout to a >>>>>>> non-controllable source of data (like Twitter feed, news feed etc)? The >>>>>>> reason I am asking is because figuring out what happens in the source of >>>>>>> your data (in terms of input rate) is really important. If your use-case >>>>>>> involves varying input-rate for your sources, I would suggest picking a >>>>>>> particular snapshot of that source, and replay your experiments in >>>>>>> order to >>>>>>> check if the variance in latency/throughput still exists. >>>>>>> >>>>>>> The second point I would like to make is that sometimes throughput >>>>>>> (or ack-rate as you correctly put it) might be related to the data you >>>>>>> are >>>>>>> pushing. For instance, a computation-heavy task might take more time >>>>>>> for a >>>>>>> particular value distribution than for another. Therefore, please make >>>>>>> sure >>>>>>> that the data you send in the system always cause the same amount of >>>>>>> computation. >>>>>>> >>>>>>> And third, noticing dropping throughput and latency at the same time >>>>>>> immediately points to a dropped input rate. Think about it. If I send in >>>>>>> tuples with a lower input rate, I expect throughput to drop (since I am >>>>>>> sending tuples with a lower input rate), and at the same time the heavy >>>>>>> computation has to work with less data (thus end-to-end latency also >>>>>>> drops). Does the previous make sense to you? Can you verify that among >>>>>>> the >>>>>>> different runs, you had consistent input rates? >>>>>>> >>>>>>> Finally, I would suggest to you that you let Storm warm-up and drop >>>>>>> your initial metrics. In my experience with Storm, latency and >>>>>>> throughput, >>>>>>> in the beginning of a task (until all buffers get full), are highly >>>>>>> variable, and therefore, not reliable data points to include in your >>>>>>> analysis. You can verify my claim by doing an overtime plot of your >>>>>>> data. >>>>>>> >>>>>>> Thanks, >>>>>>> Nikos >>>>>>> >>>>>>> On Fri, Apr 1, 2016 at 9:16 AM, John Yost <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Everyone, >>>>>>>> >>>>>>>> I am a little puzzled by what I am seeing in some testing with a >>>>>>>> topology I have where the topo is reading from a KafkaSpout, doing >>>>>>>> some CPU >>>>>>>> intensive processing, and then writing out to Kafka via the standard >>>>>>>> KafkaBolt. >>>>>>>> >>>>>>>> I am doing testing in a multi-tenant environment and so test >>>>>>>> results can vary by 10-20% on average. However, results are much more >>>>>>>> variable the last couple of days. >>>>>>>> >>>>>>>> The big thing I am noticing: whereas the throughput--as measured in >>>>>>>> tuples acked/minute--is half today of what it was yesterday for the >>>>>>>> same >>>>>>>> configuraton, the Complete Latency (total time a tuple is in the >>>>>>>> topology >>>>>>>> from the time it hits the KafkaSpout to the time it is acked in the >>>>>>>> KafkaBolt) today is a third of what it was yesterday. >>>>>>>> >>>>>>>> Any ideas as to how the throughput could go down dramatically at >>>>>>>> the same time the Complete Latency is improving? >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> --John >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Nikos R. Katsipoulakis, >>>>>>> Department of Computer Science >>>>>>> University of Pittsburgh >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Nikos R. Katsipoulakis, >>>>> Department of Computer Science >>>>> University of Pittsburgh >>>>> >>>> >>>> >>> >>> >>> -- >>> Nikos R. Katsipoulakis, >>> Department of Computer Science >>> University of Pittsburgh >>> >> >> > > > -- > Nikos R. Katsipoulakis, > Department of Computer Science > University of Pittsburgh >
