How should I measure the time spent in various stages of kafka

张瑞 Sun, 24 May 2020 12:53:31 -0700

Hi，we have some questions about Kafka and want to ask you, hope this email does 
not disturb your life. We are doing some work based on kakfa, and we measured 
the time spent at each stage in the workflow of Kafka when producing and 
consuming messages. The results show that network transmission time occupies 
most of RTT time, up to more than 80%. We are puzzled by the results, so we 
would like to ask you if there are errors in our test methods. Next is our 
experiment environment and test method. We will show the test results at the 
end.


Enviroment

We conduct experiments on two servers, one for broker and the other for 
producer of consumer. A 100Gbps switch is used for connection them. The Kafka 
version we use is 2.4.0 release, and can be downloaded from 
https://github.com/apache/kafka/releases/tag/2.4.0.

Method

We want to test the time of kafka in various stages when producing messages and 
consuming messages. The following are our relevant changes based on the source 
code.

Timestamps for producing messages

At the producer side, we mark a nano timestamp at 
./clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java 
before client.send(clientRequest, now); in function private void 
sendProduceRequest, to get the timestamp of the client sending the production 
message request. And mark a nano timestap at the same function before the line 
of handleProduceResponse(response, recordsByPartition, time.milliseconds()); in 
the same function, to get the timestamp of the client receiving response. We 
name the difference between these two timestamps as RTT.

At the broker side, we mark a nano timestamp at 
./core/src/main/scala/kafka/server/KafkaRequestHandler.scala before 
apis.handle(request) in function def run(), to get the timestamp of the broker 
receiving request from clients. And mark a timestamp at 
./core/src/main/scala/kafka/network/RequestChannel.scala before val processor = 
processors.get(response.processor) in function def sendResponse(response: 
RequestChannel.Response), to get the timestamp of the broker sending respone to 
clients. We name the difference between these two timestamps as Processing time 
in broker.

What's more, we add timestamps before and after val localProduceResults = 
appendToLocalLog(...) at 
./core/src/main/scala/kafka/server/ReplicaManager.scala in function def 
appendRecords() , to get the time for writing to local log. We name the 
difference between these two timestamps as Time for writing.

Timestamps for fetching messages

At the consumer side, we mark a nano timestamp at 
./clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java
 before RequestFuture<ClientResponse> future = client.send(fetchTarget, 
request); in function public synchronized int sendFetches(), to get the 
timestamp of the client sending the fetch request. And mark a nano timestap at 
the same function under the line of public void onSuccess(ClientResponse resp) 
{ in the same function, to get the timestamp of the client receiving response. 
We name the difference between these two timestamps as RTT.

We also add timestamps before and after val logReadResults = readFromLog() at 
./core/src/main/scala/kafka/server/ReplicaManager.scala in function 
fetchMessages , to get the time for reading from local log. We name the 
difference between these two timestamps as Time for reading.

Result

Here is the result for producing messages. Columns A and B are the independent 
variables we set. Columns C and D are measured from the broker, as we metioned 
in Timestamps for producing messages. The data in column E is obtained by 
dividing the corresponding column D data by column C, showing the proportion of 
time in writing to local log in the entire broker processing process. Column F 
is measured from the producer, as we metioned in Timestamps for producing 
messages, indicating the time from sending request to receiving response . The 
data in column G is the data in column F minus the data in column C, indicating 
the time it takes for data to be sent and received on the network.The data in 
column H is the column G divided by column F, indicating the proportion of data 
transmission time to RTT.

And the recorded results of consuming messages are shown in the figure below. 
You can see that there is a negative number, which is not realistic. But other 
data seems normal.

Question

Why is the network transmission time so high? Is it a problem with our 
timestamp design? Due to the existence of the NIO model, our test method may be 
inaccurate. How should we measure the time distribution of each stage of Kafka, 
such as the proportion of the processing time on the broker side and the 
proportion of the network transmission time? Can you give us some advice? I 
wish you a happy work, thank you!

How should I measure the time spent in various stages of kafka

Reply via email to