If you are on 0.8.1 or higher and are running with replication consider disabling the forced log flush, that will definitely lead to latency spikes as the flush is synchronous. You will still get durability from replication and the background OS flush. On Linux the background I/O flush the OS does doesn't have much impact.
Also we fixed several significant latency related bugs in 0.8.1 for the 0.8.2 release so consider giving that a try. Finally Linux write performance is itself highly variable. Even in the absence of any synchronous flushing there is some locking around I/O operations like allocating new journal blocks. If you are running linux I think we include some tuning options in the ops section of the documentation that help reduce that. There is a test class kafka.TestLinearWriteSpeed which will benchmark the throughput and latency either using a plain file or a local Kafka log. It is worth doing this to get a baseline for how fast and variable things can be in the absence of any network or coordination. -Jay -Jay On Tue, Feb 3, 2015 at 1:37 AM, Xinyi Su <xiny...@gmail.com> wrote: > Hi, > I am building Kafka cluster and run producer perf test to get Kafka latency > performance. > From test result, I notice that the long tail latency is very high and > increased with time passing by although the 99.9% result looks very good. > The worst latency can reach more than 1 second. Besides, disk utilization > is always very low, never more than 1%. I also try to tune > log.flush.interval.ms from 1000ms to 200ms. It does not help much. > > Below is the max latency chart, Y axis represents the max latency in > millisecond, X axis represents the time elapsed in milliseconds. From > chart, we can see the latency increasing from about 10ms to 1095ms > gradually. > > [image: Inline image] > > Kafka cluster is built up with 4 hosts. The version is 2.9.2-0.8.2-beta. > The PerfTopic15 topic is created with 3 partition and 3 replication. > > Here is my perf script usage: > -bash-4.1$ bin/kafka-producer-perf-test.sh --broker-list <broker > list> --topics *PerfTopic15* --sync --initial-message-id 1 --messages > 200000 --csv-reporter-enabled --metrics-dir /tmp/PerfTopic15_1 > --message-send-gap-ms 20* --request-num-acks -1* --batch-size 1 > > -bash-4.1$ bin/kafka-topics.sh --zookeeper <zkHost>:2181 --describe > --topic *PerfTopic15* > Topic:PerfTopic15 PartitionCount:3 ReplicationFactor:3 Configs: > Topic: PerfTopic15 Partition: 0 Leader: 3 Replicas: 3,4,1 Isr: 3,4,1 > Topic: PerfTopic15 Partition: 1 Leader: 4 Replicas: 4,1,2 Isr: 4,1,2 > Topic: PerfTopic15 Partition: 2 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3 > > I expect the worst latency not exceed 100 milliseconds. But the test result > is very discouraging. Do you have some points about Kafka long tail latency > issue? > > Hope for your reply! Thanks in advance! >