Hello Raman, since you are using Consumer and you are concerning about the
member-failure triggered rebalance, I think KIP-429 is most relevant to
your scenario. As Matthias mentioned we are working on getting it in to the
next release 2.4.


Guozhang

On Sat, Jul 20, 2019 at 6:36 PM Matthias J. Sax <matth...@confluent.io>
wrote:

> Static-Group membership ships with AK 2.3 (the open tickets of the KIP
> are minor):
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-345%3A+Introduce+static+membership+protocol+to+reduce+consumer+rebalances
>
> There is also KIP-415 for Kafka Connect in AK 2.3:
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-415%3A+Incremental+Cooperative+Rebalancing+in+Kafka+Connect
>
>
>
> Currently WIP is KIP-429 and KIP-441:
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-429%3A+Kafka+Consumer+Incremental+Rebalance+Protocol
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-441%3A+Smooth+Scaling+Out+for+Kafka+Streams
>
>
>
> On 7/19/19 12:31 PM, Jeff Widman wrote:
> > I am also interested in learning how others are handling this.
> >
> > I also support several services where average message processing time
> takes
> > 20 seconds per message but p99 time is about 20 minutes and the
> > stop-the-world rebalancing is very painful
> >
> > On Fri, Jul 19, 2019, 11:38 AM Raman Gupta <rocketra...@gmail.com>
> wrote:
> >
> >> I've found
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/Incremental+Cooperative+Rebalancing:+Support+and+Policies
> >> and
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/Incremental+Cooperative+Rebalancing+for+Streams
> >> .
> >> This is *exactly* what I need, right down to the Kubernetes pod
> >> restart case. The number of issues with the current approach to
> >> rebalancing elucidated in these documents is downright scary, and now
> >> I am not surprised I am having tonnes of issues.
> >>
> >> Are there any plans to start implementing delayed imbalance and
> >> standby bootstrap?
> >>
> >> Are there any short-term best practices that can help alleviate these
> >> issues? My main problem right now is the "Instance Bounce" and
> >> "Instance Failover" scenarios, and according to this wiki page,
> >> num.standby.replicas should help with at least the former. Can someone
> >> explain what this does?
> >>
> >> Regards,
> >> Raman
> >>
> >> On Fri, Jul 19, 2019 at 12:53 PM Raman Gupta <rocketra...@gmail.com>
> >> wrote:
> >>>
> >>> I have a situation in which the current rebalancing algorithm seems to
> >>> be extremely sub-optimal.
> >>>
> >>> I have a topic with 100 partitions, and up to 100 separate consumers.
> >>> Processing each message on this topic takes between 1 and 20 minutes,
> >>> depending on the message.
> >>>
> >>> If any of the 100 consumers dies or drops out of the group, there is a
> >>> huge amount of idle time as many consumers (up to 99 of them) finish
> >>> their work and sit around idle, just waiting for the rebalance to
> >>> complete.
> >>>
> >>> In addition, with 100 consumers, its not unusual for one to die for
> >>> one reason or another, so these stop-the-world rebalances are
> >>> happening all the time, making the entire system slow to a snail's
> >>> pace.
> >>>
> >>> It surprises me that rebalance is so inefficient. I would have thought
> >>> that partitions would just be assigned/unassigned to consumers in
> >>> real-time without waiting for the entire consumer group to quiesce.
> >>>
> >>> Is there anything I can do to improve matters?
> >>>
> >>> Regards,
> >>> Raman
> >>
> >
>
>

-- 
-- Guozhang

Reply via email to