Hello Raman, since you are using Consumer and you are concerning about the member-failure triggered rebalance, I think KIP-429 is most relevant to your scenario. As Matthias mentioned we are working on getting it in to the next release 2.4.
Guozhang On Sat, Jul 20, 2019 at 6:36 PM Matthias J. Sax <matth...@confluent.io> wrote: > Static-Group membership ships with AK 2.3 (the open tickets of the KIP > are minor): > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-345%3A+Introduce+static+membership+protocol+to+reduce+consumer+rebalances > > There is also KIP-415 for Kafka Connect in AK 2.3: > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-415%3A+Incremental+Cooperative+Rebalancing+in+Kafka+Connect > > > > Currently WIP is KIP-429 and KIP-441: > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-429%3A+Kafka+Consumer+Incremental+Rebalance+Protocol > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-441%3A+Smooth+Scaling+Out+for+Kafka+Streams > > > > On 7/19/19 12:31 PM, Jeff Widman wrote: > > I am also interested in learning how others are handling this. > > > > I also support several services where average message processing time > takes > > 20 seconds per message but p99 time is about 20 minutes and the > > stop-the-world rebalancing is very painful > > > > On Fri, Jul 19, 2019, 11:38 AM Raman Gupta <rocketra...@gmail.com> > wrote: > > > >> I've found > >> > https://cwiki.apache.org/confluence/display/KAFKA/Incremental+Cooperative+Rebalancing:+Support+and+Policies > >> and > >> > https://cwiki.apache.org/confluence/display/KAFKA/Incremental+Cooperative+Rebalancing+for+Streams > >> . > >> This is *exactly* what I need, right down to the Kubernetes pod > >> restart case. The number of issues with the current approach to > >> rebalancing elucidated in these documents is downright scary, and now > >> I am not surprised I am having tonnes of issues. > >> > >> Are there any plans to start implementing delayed imbalance and > >> standby bootstrap? > >> > >> Are there any short-term best practices that can help alleviate these > >> issues? My main problem right now is the "Instance Bounce" and > >> "Instance Failover" scenarios, and according to this wiki page, > >> num.standby.replicas should help with at least the former. Can someone > >> explain what this does? > >> > >> Regards, > >> Raman > >> > >> On Fri, Jul 19, 2019 at 12:53 PM Raman Gupta <rocketra...@gmail.com> > >> wrote: > >>> > >>> I have a situation in which the current rebalancing algorithm seems to > >>> be extremely sub-optimal. > >>> > >>> I have a topic with 100 partitions, and up to 100 separate consumers. > >>> Processing each message on this topic takes between 1 and 20 minutes, > >>> depending on the message. > >>> > >>> If any of the 100 consumers dies or drops out of the group, there is a > >>> huge amount of idle time as many consumers (up to 99 of them) finish > >>> their work and sit around idle, just waiting for the rebalance to > >>> complete. > >>> > >>> In addition, with 100 consumers, its not unusual for one to die for > >>> one reason or another, so these stop-the-world rebalances are > >>> happening all the time, making the entire system slow to a snail's > >>> pace. > >>> > >>> It surprises me that rebalance is so inefficient. I would have thought > >>> that partitions would just be assigned/unassigned to consumers in > >>> real-time without waiting for the entire consumer group to quiesce. > >>> > >>> Is there anything I can do to improve matters? > >>> > >>> Regards, > >>> Raman > >> > > > > -- -- Guozhang