Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

2021-06-22 Thread Satish Duggana
Hi Jun, I updated the KIP-501 with more details. Please take a look and provide your comments. This issue occurred several times in multiple production environments(Uber, Yelp, Twitter, etc). Thanks, Satish. On Thu, 13 Feb 2020 at 17:04, Satish Duggana wrote: > > Hi Lucas, > Thanks for

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

2020-08-18 Thread Flavien Raynaud
Hi there, Just a small nudge on this as this happened once more at Yelp  Was there any progress on this? If not, is there anything we can do to help? Thank you, Flavien On 2020/02/13 11:34:14, Satish Duggana wrote: > Hi Lucas,> > Thanks for looking into the KIP and providing your comments.>

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

2020-02-13 Thread Satish Duggana
Hi Lucas, Thanks for looking into the KIP and providing your comments. Adding to what Harsha mentioned, I do not think there is a fool proof solution here to solve the cases like pending requests in the request queue. We also thought about the option of relinquishing the leadership but the

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

2020-02-11 Thread Harsha Chintalapani
Hi Lucas, Yes the case you mentioned is true. I do understand KIP-501 might not fully solve this particular use case where there might blocked fetch requests. But the issue we noticed multiple times and continue to notice is 1. Fetch request comes from Follower 2.

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

2020-02-10 Thread Lucas Bradstreet
Hi Harsha, Is the problem you'd like addressed the following? Assume 3 replicas, L and F1 and F2. 1. F1 and F2 are alive and sending fetch requests to L. 2. L starts encountering disk issues, any requests being processed by the request handler threads become blocked. 3. L's zookeeper connection

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

2020-02-10 Thread Harsha Ch
Hi Jason & Jun,                  Do you have any feedback on the KIP or is it ok take it to voting?. Its good to have this config in Kafka to address disk failure scenarios as described in the KIP. Thanks, Harsha On Mon, Feb 10, 2020 at 5:10 PM, Brian Sang < bais...@yelp.com.invalid > wrote:

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

2020-02-10 Thread Brian Sang
Hi, Just wanted to bump this discussion, since it happened to us again at Yelp  It's particularly nasty since it can happen right before a disk failure, so right as the leader for the partition becomes the only ISR, the leader becomes unrecoverable right after, forcing us to do an unclean

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

2020-01-21 Thread Satish Duggana
Hi Jun, Can you please review the KIP and let us know your comments? If there are no comments/questions, we can start a vote thread. It looks like Yelp folks also encountered the same issue as mentioned in JIRA comment[1]. >> Flavien Raynaud added a comment - Yesterday We've seen offline

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

2019-12-04 Thread Harsha Chintalapani
Hi Jason, As Satish said just increase replica max lag will not work in this case. Just before a disk dies the reads becomes really slow and its hard to estimate how much this is, as we noticed range is pretty wide. Overall it doesn't make sense to knock good replicas out of just because

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

2019-11-18 Thread Satish Duggana
Hi Jason, Thanks for looking into the KIP. Apologies for my late reply. Increasing replica max lag to 30-45 secs did not help as we observed that a few fetch requests took more than 1-2 minutes. We do not want to increase further as it increases upper bound on commit latency. We have strict SLAs

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

2019-11-13 Thread Jason Gustafson
Hi Satish, Thanks for the KIP. I'm wondering how much of this problem can be addressed just by increasing the replication max lag? That was one of the purposes of KIP-537 (the default increased from 10s to 30s). Also, the new configurations seem quite low level. I think they will be hard for

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

2019-11-06 Thread Satish Duggana
Hi Dhruvil, Thanks for looking into the KIP. 10. I have an initial sketch of the KIP-500 in commit[a] which discusses tracking the pending fetch requests. Tracking is not done in Partition#readRecords because if it takes longer in reading any of the partitions then we do not want any of the

[DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests not processed in time

2019-10-28 Thread Satish Duggana
Hi All, I wrote a short KIP about avoiding out-of-sync or offline partitions when follower fetch requests are not processed in time by the leader replica. KIP-501 is located at https://s.apache.org/jhbpn Please take a look, I would like to hear your feedback and suggestions. JIRA: