[ 
https://issues.apache.org/jira/browse/KAFKA-7152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552105#comment-16552105
 ] 

Dong Lin edited comment on KAFKA-7152 at 7/22/18 6:06 PM:
----------------------------------------------------------

Thanks for the clarification [~ijuma]. Yeah I understand immediate result of 
this patch, i.e. ISR won't shrink if a topic is not produced often and a 
replica thread dies. But I am trying to understand what is the concern with 
this behavior.

Let's assume replica.lag.time.max.ms uses the default value of 10 sec. In this 
case maybeShrinkIsr() will be called every 5 sec. We can assume for now that 
everything else is fast and take negligible time.

Regarding 1, you are right for a ProduceRequest which is sent long after the 
thread dies, the produce delay (and end-to-end latency) will be higher if we 
don't shrink ISR. More specifically, if we shrink ISR, leader won't wait for 
the problematic follower and ProduceRequest can succeed with 0 sec delay, 
whereas if we don't shrink ISRE, the ProduceRequest can succeed within 5 sec 
because ISR will be shrunk immediately when maybeShrinkIsr() is called.

But if we look at arbitrary ProduceRequest sent by the user, this would include 
the ProduceRequest that can potentially be sent immediately after the thread 
dies. For such ProduceRequest, the max ProduceRequest delay would also be 5 
sec. So the overall max produce delay (and end-to-end latency) does not change 
with this patch. And since only the first ProduceRequest may be affected, it 
probably does not affect the 99th and average produce delay. So this patch 
probably does not negatively affect the overall user experience regarding 
produce delay.

Also, I personally feel that ReplicaFetcherThread will die only if there is 
bug. In this case, it is probably not a big deal to have extra 5 seconds delay, 
given that there maybe bigger issue caused by bug, partition has been inactive 
for much longer than 5 seconds, and it only affects the first ProduceRequest.

 

Regarding 2, yes we should be able to monitor whether the ReplicaFetcherThread 
is running. It can currently be achieved by monitoring ISR but it is probably 
not a good way for this purpose. In general when ISR is shrink, system admin 
does not directly know whether it is because lag or because 
ReplicaFetcherThread dies, which means it can take a long time before system 
admin takes action. And it does not directly tell which broker has the issue.

 

In order to solver this problem, maybe it is better to add metrics in the 
broker to monitor the NumOfflineThread, FetchRequestRate and 
FetchRequestLocalTime in the follower (not the leader). The NumOfflineThread 
should increase whenever a thread has died abnormally in the broker. And system 
admin can be alerted if this metric increases above 0. The other two metrics 
can very be useful in debugging replication performance issue in the follower.

 

So it seems that in the short term this patch won't change the performance 
guarantee (e.g. max, average, 99th produce delay) for user. And it will 
slightly increase the produce delay when there is bug Kafka. And in the long 
term we probably want to add more metrics in Kafka to detect such bug and let 
system admin proactively resolves the issue. Does this sound good?

 

 

 

 


was (Author: lindong):
Thanks for the clarification [~ijuma]. Yeah I understand immediate result of 
this patch, i.e. ISR won't if a topic that is not produced often and a replica 
thread dies. But I am trying to understand what is the concern with this 
behavior.

Let's assume replica.lag.time.max.ms uses the default value of 10 sec. In this 
case maybeShrinkIsr() will be called every 5 sec. We can assume for now that 
everything else is fast and take negligible time.

Regarding 1, you are right for a ProduceRequest which is sent long after the 
thread dies, the produce delay (and end-to-end latency) will be higher if we 
don't shrink ISR. More specifically, if we shrink ISR, leader won't wait for 
the problematic follower and ProduceRequest can succeed with 0 sec delay, 
whereas if we don't shrink ISRE, the ProduceRequest can succeed within 5 sec 
because ISR will be shrunk immediately when maybeShrinkIsr() is called.

But if we look at arbitrary ProduceRequest sent by the user, this would include 
the ProduceRequest that can potentially be sent immediately after the thread 
dies. For such ProduceRequest, the max ProduceRequest delay would also be 5 
sec. So the overall max produce delay (and end-to-end latency) does not change 
with this patch. And since only the first ProduceRequest may be affected, it 
probably does not affect the 99th and average produce delay. So this patch 
probably does not negatively affect the overall user experience regarding 
produce delay.

Also, I personally feel that ReplicaFetcherThread will die only if there is 
bug. In this case, it is probably not a big deal to have extra 5 seconds delay, 
given that there maybe bigger issue caused by bug, partition has been inactive 
for much longer than 5 seconds, and it only affects the first ProduceRequest.

 

Regarding 2, yes we should be able to monitor whether the ReplicaFetcherThread 
is running. It can currently be achieved by monitoring ISR but it is probably 
not a good way for this purpose. In general when ISR is shrink, system admin 
does not directly know whether it is because lag or because 
ReplicaFetcherThread dies, which means it can take a long time before system 
admin takes action. And it does not directly tell which broker has the issue.

 

In order to solver this problem, maybe it is better to add metrics in the 
broker to monitor the NumOfflineThread, FetchRequestRate and 
FetchRequestLocalTime in the follower (not the leader). The NumOfflineThread 
should increase whenever a thread has died abnormally in the broker. And system 
admin can be alerted if this metric increases above 0. The other two metrics 
can very be useful in debugging replication performance issue in the follower.

 

So it seems that in the short term this patch won't change the performance 
guarantee (e.g. max, average, 99th produce delay) for user. And it will 
slightly increase the produce delay when there is bug Kafka. And in the long 
term we probably want to add more metrics in Kafka to detect such bug and let 
system admin proactively resolves the issue. Does this sound good?

 

 

 

 

> replica should be in-sync if its LEO equals leader's LEO
> --------------------------------------------------------
>
>                 Key: KAFKA-7152
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7152
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Dong Lin
>            Assignee: Zhanxiang (Patrick) Huang
>            Priority: Major
>             Fix For: 2.1.0
>
>
> Currently a replica will be moved out of ISR if follower has not fetched from 
> leader for 10 sec (default replica.lag.time.max.ms). This cases problem in 
> the following scenario:
> Say follower's ReplicaFetchThread needs to fetch 2k partitions from the 
> leader broker. Only 100 out of 2k partitions are actively being produced to 
> and therefore the total bytes in rate for those 2k partitions are small. The 
> following will happen:
>  
> 1) The follower's ReplicaFetcherThread sends FetchRequest for those 2k 
> partitions.
> 2) Because the total bytes-in-rate for those 2k partitions is very small, 
> follower is able to catch up and leader broker adds these 2k partitions to 
> ISR. Follower's lastCaughtUpTimeMs for all partitions are updated to the 
> current time T0.
> 3) Since follower has caught up for all 2k partitions, leader updates 2k 
> partition znodes to include the follower in the ISR. It may take 20 seconds 
> to write 2k partition znodes if each znode write operation takes 10 ms.
> 4) At T0 + 15, maybeShrinkIsr() is invoked on leader broker. Since there is 
> no FetchRequet from the follower for more than 10 seconds after T0, all those 
> 2k partitions will be considered as out of syn and the follower will be 
> removed from ISR.
> 5) The follower receives FetchResponse at least 20 seconds after T0. That 
> means the next FetchRequest from follower to leader will be after T0 + 20.
> The sequence of events described above will loop over time. There will be 
> constant churn of URP in the cluster even if follower can catch up with 
> leader's byte-in-rate. This reduces the cluster availability.
>  
> In order to address this problem, one simple approach is to keep follower in 
> the ISR as long as follower's LEO equals leader's LEO regardless of 
> follower's lastCaughtUpTimeMs. This is particularly useful if there are a lot 
> of inactive partitions in the cluster.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to