Haruki Okada created KAFKA-10690: ------------------------------------ Summary: Produce-response delay caused by lagging replica fetch which blocks in-sync one Key: KAFKA-10690 URL: https://issues.apache.org/jira/browse/KAFKA-10690 Project: Kafka Issue Type: Improvement Components: core Affects Versions: 2.4.1 Reporter: Haruki Okada Attachments: image-2020-11-06-11-15-21-781.png, image-2020-11-06-11-15-38-390.png, image-2020-11-06-11-17-09-910.png
h2. Our environment * Kafka version: 2.4.1 h2. Phenomenon * Produce response time 99th (remote scope) degrades to 500ms, which is 20 times worse than usual ** Meanwhile, the cluster was running replica reassignment to service-in new machine to recover replicas which held by failed (Hardware issue) broker machine !image-2020-11-06-11-15-21-781.png|width=292,height=166! h2. Analysis Let's say * broker-X: The broker we observed produce latency degradation * broker-Y: The broker under servicing-in broker-Y was catching up replicas of partitions: * partition-A: has relatively small log size * partition-B: has large log size (actually, broker-Y was catching-up many other partitions. I noted only two partitions here to make explanation simple) broker-X was the leader for both partition-A and partition-B. We found that both partition-A and partition-B are assigned to same ReplicaFetcherThread of broker-Y, and produce latency started to degrade right after broker-Y finished catching up partition-A. !image-2020-11-06-11-17-09-910.png|width=476,height=174! Besides, we observed disk reads on broker-X during service-in. (This is natural since old segments are likely not in page cache) !image-2020-11-06-11-15-38-390.png|width=292,height=193! So we suspected that: * In-sync replica fetch (partition-A) was involved by lagging replica fetch (partition-B), which should be slow because it causes actual disk reads ** Since ReplicaFetcherThread sends fetch requests in blocking manner, next fetch request can't be sent until one fetch request completes ** => Causes in-sync replica fetch for partitions assigned to same replica fetcher thread to delay ** => Causes remote scope produce latency degradation h2. Possible fix We think this issue can be addressed by designating part of ReplicaFetcherThread (or creating another thread pool) for lagging replica catching-up, but not so sure this is the appropriate way. Please give your opinions about this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)