Re: some producers stuck when one broker is bad

Steven Wu Fri, 11 Sep 2015 23:16:22 -0700

I was doing a rolling bounce of all brokers. Immediately after the bad
broker was bounced, those stuck producers recovered


On Fri, Sep 11, 2015 at 9:05 AM, Mayuresh Gharat <gharatmayures...@gmail.com
> wrote:

> So how did you detect that the broker is bad? If bouncing brokers solved
> the problem and you did not find any unusual things in the logs on brokers
> , it is likely that the process was up but was isolated from producer
> request and since the producer did not have timeout the producer buffer
> filled up.
>
> Thanks,
>
> Mayuresh
>
>
> On Thu, Sep 10, 2015 at 11:20 PM, Steven Wu <stevenz...@gmail.com> wrote:
>
> > frankly I don't know exactly what went BAD for that broker. process is
> > still UP.
> >
> > On Wed, Sep 9, 2015 at 10:10 AM, Mayuresh Gharat <
> > gharatmayures...@gmail.com
> > > wrote:
> >
> > > 1) any suggestion on how to identify the bad broker(s)?
> > > ---> At Linkedin we have alerts that are setup using our internal
> scripts
> > > for detecting if a broker has gone bad. We also check the under
> > replicated
> > > partitions and that can tell us which broker has gone bad. By broker
> > going
> > > bad, it can mean different things. Like the broker is alive but not
> > > responding and is completely isolated or the broker has gone down, etc.
> > > Can you tell us what you meant by your BROKER went BAD?
> > >
> > > 2) why bouncing of the bad broker got the producers recovered
> > automatically
> > > ----> This is because as you bounced, the leaders for other partitions
> > > changed and producer sent out a TopicMetadataRequest which tells the
> > > producer who are the new leaders for the partitions and the producer
> > > started sending messages to those brokers.
> > >
> > > KAFKA-2120 will handle all of this for you automatically.
> > >
> > > Thanks,
> > >
> > > Mayuresh
> > >
> > > On Tue, Sep 8, 2015 at 8:26 PM, Steven Wu <stevenz...@gmail.com>
> wrote:
> > >
> > > > We have observed that some producer instances stopped sending traffic
> > to
> > > > brokers, because the memory buffer is full. those producers got stuck
> > in
> > > > this state permanently. Because we couldn't find out which broker is
> > bad
> > > > here. So I did a rolling restart the all brokers. after the bad
> broker
> > > got
> > > > bounce, those stuck producers out of the woods automatically.
> > > >
> > > > I don't know the exact problem with that bad broker. it seems to me
> > that
> > > > some ZK states are inconsistent.
> > > >
> > > > I know timeout fix from KAFKA-2120 can probably avoid the permanent
> > > stuck.
> > > > Here are some additional questions.
> > > > 1) any suggestion on how to identify the bad broker(s)?
> > > > 2) why bouncing of the bad broker got the producers recovered
> > > automatically
> > > > (without restarting producers)
> > > >
> > > > producer: 0.8.2.1
> > > > broker: 0.8.2.1
> > > >
> > > > Thanks,
> > > > Steven
> > > >
> > >
> > >
> > >
> > > --
> > > -Regards,
> > > Mayuresh R. Gharat
> > > (862) 250-7125
> > >
> >
>
>
>
> --
> -Regards,
> Mayuresh R. Gharat
> (862) 250-7125
>

Re: some producers stuck when one broker is bad

Reply via email to