So, I did my own latency test on a cluster of 3 nodes, and there is a significant difference around the 99%’ile and higher for partitions when measuring the the ack time when configured for a single ack. The graph that I wish I could attach or post clearly shows that around 1/3 of the partitions significantly diverge from the other two. So, at least in my case, one of my brokers is further than the others. -Erik
On 9/4/15, 1:06 PM, "Yuheng Du" <[email protected]> wrote: >No problem. Thanks for your advice. I think it would be fun to explore. I >only know how to program in java though. Hope it will work. > >On Fri, Sep 4, 2015 at 2:03 PM, Helleren, Erik ><[email protected]> >wrote: > >> I thing the suggestion is to have partitions/brokers >=1, so 32 should >>be >> enough. >> >> As for latency tests, there isn’t a lot of code to do a latency test. >>If >> you just want to measure ack time its around 100 lines. I will try to >> push out some good latency testing code to github, but my company is >> scared of open sourcing code… so it might be a while… >> -Erik >> >> >> On 9/4/15, 12:55 PM, "Yuheng Du" <[email protected]> wrote: >> >> >Thanks for your reply Erik. I am running some more tests according to >>your >> >suggestions now and I will share with my results here. Is it necessary >>to >> >use a fixed number of partitions (32 partitions maybe) for my test? >> > >> >I am testing 2, 4, 8, 16 and 32 brokers scenarios, all of them are >>running >> >on individual physical nodes. So I think using at least 32 partitions >>will >> >make more sense? I have seen latencies increase as the number of >> >partitions >> >goes up in my experiments. >> > >> >To get the latency of each event data recorded, are you suggesting >>that I >> >rewrite my own test program (in Java perhaps) or I can just modify the >> >standard test program provided by kafka ( >> >https://gist.github.com/jkreps/c7ddb4041ef62a900e6c )? I guess I need >>to >> >rebuild the source if I modify the standard java test program >> >ProducerPerformance provided in kafka, right? Now this standard program >> >only has average latencies and percentile latencies but no per event >> >latencies. >> > >> >Thanks. >> > >> >On Fri, Sep 4, 2015 at 1:42 PM, Helleren, Erik >> ><[email protected]> >> >wrote: >> > >> >> That is an excellent question! There are a bunch of ways to monitor >> >> jitter and see when that is happening. Here are a few: >> >> >> >> - You could slice the histogram every few seconds, save it out with a >> >> timestamp, and then look at how they compare. This would be mostly >> >> manual, or you can graph line charts of the percentiles over time in >> >>excel >> >> where each percentile would be a series. If you are using HDR >> >>Histogram, >> >> you should look at how to use the Recorder class to do this coupled >> >>with a >> >> ScheduledExecutorService. >> >> >> >> - You can just save the starting timestamp of the event and the >>latency >> >>of >> >> each event. If you put it into a CSV, you can just load it up into >> >>excel >> >> and graph as a XY chart. That way you can see every point during the >> >> running of your program and you can see trends. You want to be >>careful >> >> about this one, especially of writing to a file in the callback that >> >>kfaka >> >> provides. >> >> >> >> Also, I have noticed that most of the very slow observations are at >> >> startup. But don’t trust me, trust the data and share your findings. >> >> Also, having a 99.9 percentile provides a pretty good standard for >> >>typical >> >> poor case performance. Average is borderline useless, 50%’ile is a >> >>better >> >> typical case because that’s the number that says “half of events >>will be >> >> this slow or faster”, or for values that are high like 99.9%’ile, >>“0.1% >> >>of >> >> all events will be slower than this”. >> >> -Erik >> >> >> >> On 9/4/15, 12:05 PM, "Yuheng Du" <[email protected]> wrote: >> >> >> >> >Thank you Erik! That's is helpful! >> >> > >> >> >But also I see jitters of the maximum latencies when running the >> >> >experiment. >> >> > >> >> >The average end to acknowledgement latency from producer to broker >>is >> >> >around 5ms when using 92 producers and 4 brokers, and the 99.9 >> >>percentile >> >> >latency is 58ms, but the maximum latency goes up to 1359 ms. How to >> >>locate >> >> >the source of this jitter? >> >> > >> >> >Thanks. >> >> > >> >> >On Fri, Sep 4, 2015 at 10:54 AM, Helleren, Erik >> >> ><[email protected]> >> >> >wrote: >> >> > >> >> >> WellŠ not to be contrarian, but latency depends much more on the >> >>latency >> >> >> between the producer and the broker that is the leader for the >> >>partition >> >> >> you are publishing to. At least when your brokers are not >>saturated >> >> >>with >> >> >> messages, and acks are set to 1. If acks are set to ALL, latency >>on >> >>an >> >> >> non-saturated kafka cluster will be: Round Trip Latency from >> >>producer to >> >> >> leader for partition + Max( slowest Round Trip Latency to a >>replicas >> >>of >> >> >> that partition). If a cluster is saturated with messages, we >>have to >> >> >> assume that all partitions receive an equal distribution of >>messages >> >>to >> >> >> avoid linear algebra and queueing theory models. I don¹t like >>linear >> >> >> algebra :P >> >> >> >> >> >> Since you are probably putting all your latencies into a single >> >> >>histogram >> >> >> per producer, or worse, just an average, this pattern would have >>been >> >> >> obscured. Obligatory lecture about measuring latency by Gil Tene >> >> >> (https://www.youtube.com/watch?v=9MKY4KypBzg). To verify this >> >> >>hypothesis, >> >> >> you should re-write the benchmark to plot the latencies for each >> >>write >> >> >>to >> >> >> a partition for each producer into a histogram. (HRD histogram is >> >>pretty >> >> >> good for that). This would give you producers*partitions >>histograms, >> >> >> which might be unwieldy for that many producers. But wait, there >>is >> >> >>hope! >> >> >> >> >> >> To verify that this hypothesis holds, you just have to see that >>there >> >> >>is a >> >> >> significant difference between different partitions on a SINGLE >> >> >>producing >> >> >> client. So, pick one producing client at random and use the data >>from >> >> >> that. The easy way to do that is just plot all the partition >>latency >> >> >> histograms on top of each other in the same plot, that way you >>have a >> >> >> pretty plot to show people. If you don¹t want to setup plotting, >>you >> >> >>can >> >> >> just compare the medians (50¹th percentile) of the partitions¹ >> >> >>histograms. >> >> >> If there is a lot of variance, your latency anomaly is explained >>by >> >> >> brokers 4-7 being slower than nodes 0-3! If there isn¹t a lot of >> >> >>variance >> >> >> at 50%, look at higher percentiles. And if higher percentiles for >> >>all >> >> >>the >> >> >> partitions look the same, this hypothesis is disproved. >> >> >> >> >> >> If you want to make a general statement about latency of writing >>to >> >> >>kafka, >> >> >> you can merge all the histograms into a single histogram and plot >> >>that. >> >> >> >> >> >> To Yuheng¹s credit, more brokers always results in more >>throughput. >> >>But >> >> >> throughput and latency are two different creatures. Its worth >>noting >> >> >>that >> >> >> kafka is designed to be high throughput first and low latency >>second. >> >> >>And >> >> >> it does a really good job at both. >> >> >> >> >> >> Disclaimer: I might not like linear algebra, but I do like >> >>statistics. >> >> >> Let me know if there are topics that need more explanation above >>that >> >> >> aren¹t covered by Gil¹s lecture. >> >> >> -Erik >> >> >> >> >> >> On 9/4/15, 9:03 AM, "Yuheng Du" <[email protected]> wrote: >> >> >> >> >> >> >When I using 32 partitions, the 4 brokers latency becomes larger >> >>than >> >> >>the >> >> >> >8 >> >> >> >brokers latency. >> >> >> > >> >> >> >So is it always true that using more brokers can give less >>latency >> >>when >> >> >> >the >> >> >> >number of partitions is at least the size of the brokers? >> >> >> > >> >> >> >Thanks. >> >> >> > >> >> >> >On Thu, Sep 3, 2015 at 10:45 PM, Yuheng Du >> >><[email protected]> >> >> >> >wrote: >> >> >> > >> >> >> >> I am running a producer latency test. When using 92 producers >>in >> >>92 >> >> >> >> physical node publishing to 4 brokers, the latency is slightly >> >>lower >> >> >> >>than >> >> >> >> using 8 brokers, I am using 8 partitions for the topic. >> >> >> >> >> >> >> >> I have rerun the test and it gives me the same result, the 4 >> >>brokers >> >> >> >> scenario still has lower latency than the 8 brokers scenarios. >> >> >> >> >> >> >> >> It is weird because I tested 1broker, 2 brokers, 4 brokers, 8 >> >> >>brokers, >> >> >> >>16 >> >> >> >> brokers and 32 brokers. For the rest of the case the latency >> >> >>decreases >> >> >> >>as >> >> >> >> the number of brokers increase. >> >> >> >> >> >> >> >> 4 brokers/8 brokers is the only pair that doesn't satisfy this >> >>rule. >> >> >> >>What >> >> >> >> could be the cause? >> >> >> >> >> >> >> >> I am using a 200 bytes message, the test let each producer >> >>publishes >> >> >> >>500k >> >> >> >> messages to a given topic. Every test run when I change the >> >>number of >> >> >> >> brokers, I use a new topic. >> >> >> >> >> >> >> >> Thanks for any advices. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>
