I was doing a rolling bounce of all brokers. Immediately after the bad broker was bounced, those stuck producers recovered
On Fri, Sep 11, 2015 at 9:05 AM, Mayuresh Gharat <gharatmayures...@gmail.com > wrote: > So how did you detect that the broker is bad? If bouncing brokers solved > the problem and you did not find any unusual things in the logs on brokers > , it is likely that the process was up but was isolated from producer > request and since the producer did not have timeout the producer buffer > filled up. > > Thanks, > > Mayuresh > > > On Thu, Sep 10, 2015 at 11:20 PM, Steven Wu <stevenz...@gmail.com> wrote: > > > frankly I don't know exactly what went BAD for that broker. process is > > still UP. > > > > On Wed, Sep 9, 2015 at 10:10 AM, Mayuresh Gharat < > > gharatmayures...@gmail.com > > > wrote: > > > > > 1) any suggestion on how to identify the bad broker(s)? > > > ---> At Linkedin we have alerts that are setup using our internal > scripts > > > for detecting if a broker has gone bad. We also check the under > > replicated > > > partitions and that can tell us which broker has gone bad. By broker > > going > > > bad, it can mean different things. Like the broker is alive but not > > > responding and is completely isolated or the broker has gone down, etc. > > > Can you tell us what you meant by your BROKER went BAD? > > > > > > 2) why bouncing of the bad broker got the producers recovered > > automatically > > > ----> This is because as you bounced, the leaders for other partitions > > > changed and producer sent out a TopicMetadataRequest which tells the > > > producer who are the new leaders for the partitions and the producer > > > started sending messages to those brokers. > > > > > > KAFKA-2120 will handle all of this for you automatically. > > > > > > Thanks, > > > > > > Mayuresh > > > > > > On Tue, Sep 8, 2015 at 8:26 PM, Steven Wu <stevenz...@gmail.com> > wrote: > > > > > > > We have observed that some producer instances stopped sending traffic > > to > > > > brokers, because the memory buffer is full. those producers got stuck > > in > > > > this state permanently. Because we couldn't find out which broker is > > bad > > > > here. So I did a rolling restart the all brokers. after the bad > broker > > > got > > > > bounce, those stuck producers out of the woods automatically. > > > > > > > > I don't know the exact problem with that bad broker. it seems to me > > that > > > > some ZK states are inconsistent. > > > > > > > > I know timeout fix from KAFKA-2120 can probably avoid the permanent > > > stuck. > > > > Here are some additional questions. > > > > 1) any suggestion on how to identify the bad broker(s)? > > > > 2) why bouncing of the bad broker got the producers recovered > > > automatically > > > > (without restarting producers) > > > > > > > > producer: 0.8.2.1 > > > > broker: 0.8.2.1 > > > > > > > > Thanks, > > > > Steven > > > > > > > > > > > > > > > > -- > > > -Regards, > > > Mayuresh R. Gharat > > > (862) 250-7125 > > > > > > > > > -- > -Regards, > Mayuresh R. Gharat > (862) 250-7125 >