Hi Jun,
I updated the KIP-501 with more details. Please take a look and
provide your comments.
This issue occurred several times in multiple production
environments(Uber, Yelp, Twitter, etc).
Thanks,
Satish.
On Thu, 13 Feb 2020 at 17:04, Satish Duggana wrote:
>
> Hi Lucas,
> Thanks for
Hi there,
Just a small nudge on this as this happened once more at Yelp
Was there any progress on this? If not, is there anything we can do to help?
Thank you,
Flavien
On 2020/02/13 11:34:14, Satish Duggana wrote:
> Hi Lucas,>
> Thanks for looking into the KIP and providing your comments.>
Hi Lucas,
Thanks for looking into the KIP and providing your comments.
Adding to what Harsha mentioned, I do not think there is a fool proof
solution here to solve the cases like pending requests in the request
queue. We also thought about the option of relinquishing the
leadership but the
Hi Lucas,
Yes the case you mentioned is true. I do understand KIP-501
might not fully solve this particular use case where there might blocked
fetch requests. But the issue we noticed multiple times and continue to
notice is
1. Fetch request comes from Follower
2.
Hi Harsha,
Is the problem you'd like addressed the following?
Assume 3 replicas, L and F1 and F2.
1. F1 and F2 are alive and sending fetch requests to L.
2. L starts encountering disk issues, any requests being processed by
the request handler threads become blocked.
3. L's zookeeper connection
Hi Jason & Jun,
Do you have any feedback on the KIP or is it ok take it to
voting?. Its good to have this config in Kafka to address disk failure
scenarios as described in the KIP.
Thanks,
Harsha
On Mon, Feb 10, 2020 at 5:10 PM, Brian Sang < bais...@yelp.com.invalid > wrote:
Hi,
Just wanted to bump this discussion, since it happened to us again at Yelp
It's particularly nasty since it can happen right before a disk failure, so
right as the leader for the partition becomes the only ISR, the leader becomes
unrecoverable right after, forcing us to do an unclean
Hi Jun,
Can you please review the KIP and let us know your comments?
If there are no comments/questions, we can start a vote thread.
It looks like Yelp folks also encountered the same issue as mentioned
in JIRA comment[1].
>> Flavien Raynaud added a comment - Yesterday
We've seen offline
Hi Jason,
As Satish said just increase replica max lag will not work in this
case. Just before a disk dies the reads becomes really slow and its hard to
estimate how much this is, as we noticed range is pretty wide. Overall it
doesn't make sense to knock good replicas out of just because
Hi Jason,
Thanks for looking into the KIP. Apologies for my late reply.
Increasing replica max lag to 30-45 secs did not help as we observed
that a few fetch requests took more than 1-2 minutes. We do not want
to increase further as it increases upper bound on commit latency. We
have strict SLAs
Hi Satish,
Thanks for the KIP. I'm wondering how much of this problem can be
addressed just by increasing the replication max lag? That was one of the
purposes of KIP-537 (the default increased from 10s to 30s). Also, the new
configurations seem quite low level. I think they will be hard for
Hi Dhruvil,
Thanks for looking into the KIP.
10. I have an initial sketch of the KIP-500 in commit[a] which
discusses tracking the pending fetch requests. Tracking is not done in
Partition#readRecords because if it takes longer in reading any of the
partitions then we do not want any of the
Hi All,
I wrote a short KIP about avoiding out-of-sync or offline partitions
when follower fetch requests are not processed in time by the leader
replica.
KIP-501 is located at https://s.apache.org/jhbpn
Please take a look, I would like to hear your feedback and suggestions.
JIRA:
13 matches
Mail list logo