Hey Nikos, Thanks for responding so quickly, and I apologize for leaving out a crucially important detail--the Kafka topic. My topo is reading from a static topic. I definitely agree that reading from a live topic could--and would likely--lead to variable throughput rates, both in terms of raw input rates as well as variability in the content. Again, great question and points, I should have specified my topo is reading from a static Kafka topic in my original post.
Regarding your third point, my thinking is that throughput would go up if Complete Latency went down since its my understanding that Complete Latency measures the avg amount of time that each tuple spends in the topology. The key if here is if the input rate stays the same. If Complete Latency decreases, more tuples can be processed by the topology in a given amount time. But I see what you're saying the avg time spent on each tuple would be less if the input rate goes up because there's more data per second, more context switching amongst the executors, etc... Please confirm if I am thinking about this the wrong way, because this seems to be a pretty fundamental fact about Storm that I need to have right. Great point regarding waiting for topology to complete warm up. I let my topo run for 20 minutes before measuring anything. Thanks --John On Fri, Apr 1, 2016 at 9:54 AM, Nikos R. Katsipoulakis < [email protected]> wrote: > Hello John, > > I have to say that a system's telemetry is not a mystery easily > understood. Then, let us try to deduce what might be the case in your > use-case that causes inconsistent performance metrics. > > At first, I would like to ask if your KafkaSpout's produce tuples with the > same rate. In other words, do you produce or read data in a deterministic > (replay-able) way; or do you attach your KafkaSpout to a non-controllable > source of data (like Twitter feed, news feed etc)? The reason I am asking > is because figuring out what happens in the source of your data (in terms > of input rate) is really important. If your use-case involves varying > input-rate for your sources, I would suggest picking a particular snapshot > of that source, and replay your experiments in order to check if the > variance in latency/throughput still exists. > > The second point I would like to make is that sometimes throughput (or > ack-rate as you correctly put it) might be related to the data you are > pushing. For instance, a computation-heavy task might take more time for a > particular value distribution than for another. Therefore, please make sure > that the data you send in the system always cause the same amount of > computation. > > And third, noticing dropping throughput and latency at the same time > immediately points to a dropped input rate. Think about it. If I send in > tuples with a lower input rate, I expect throughput to drop (since I am > sending tuples with a lower input rate), and at the same time the heavy > computation has to work with less data (thus end-to-end latency also > drops). Does the previous make sense to you? Can you verify that among the > different runs, you had consistent input rates? > > Finally, I would suggest to you that you let Storm warm-up and drop your > initial metrics. In my experience with Storm, latency and throughput, in > the beginning of a task (until all buffers get full), are highly variable, > and therefore, not reliable data points to include in your analysis. You > can verify my claim by doing an overtime plot of your data. > > Thanks, > Nikos > > On Fri, Apr 1, 2016 at 9:16 AM, John Yost <[email protected]> wrote: > >> Hi Everyone, >> >> I am a little puzzled by what I am seeing in some testing with a topology >> I have where the topo is reading from a KafkaSpout, doing some CPU >> intensive processing, and then writing out to Kafka via the standard >> KafkaBolt. >> >> I am doing testing in a multi-tenant environment and so test results can >> vary by 10-20% on average. However, results are much more variable the >> last couple of days. >> >> The big thing I am noticing: whereas the throughput--as measured in >> tuples acked/minute--is half today of what it was yesterday for the same >> configuraton, the Complete Latency (total time a tuple is in the topology >> from the time it hits the KafkaSpout to the time it is acked in the >> KafkaBolt) today is a third of what it was yesterday. >> >> Any ideas as to how the throughput could go down dramatically at the same >> time the Complete Latency is improving? >> >> Thanks >> >> --John >> > > > > -- > Nikos R. Katsipoulakis, > Department of Computer Science > University of Pittsburgh >
