Re: Broker brought down and under replicated partitions

Jean-Pascal Billaud Wed, 15 Oct 2014 23:10:16 -0700

The only thing that I find very weird is the fact that brokers that are
dead are still part of the ISR set for hours... and are basically not
removed. Note this is not constantly the case, most of the dead brokers are
properly removed and it is really just in a few cases. I am not sure why
this would happen. Is there a known issue in the 0.8.0 version that was
fixed later on? What can I do to diagnose/fix the situation?


Thanks,

On Wed, Oct 15, 2014 at 9:58 AM, Jean-Pascal Billaud <j...@tellapart.com>
wrote:

> So I am using 0.8.0. I think I found the issue actually. It turns out that
> some partitions only had a single replica and the leaders of those
> partitions would basically "refuse" new writes. As soon as I reassigned
> replicas to those partitions things kicked off again. Not sure if that's
> expected... but that seemed to make the problem go away.
>
> Thanks,
>
>
> On Wed, Oct 15, 2014 at 6:46 AM, Neha Narkhede <neha.narkh...@gmail.com>
> wrote:
>
>> Which version of Kafka are you using? The current stable one is 0.8.1.1
>>
>> On Tue, Oct 14, 2014 at 5:51 PM, Jean-Pascal Billaud <j...@tellapart.com>
>> wrote:
>>
>> > Hey Neha,
>> >
>> > so I removed another broker like 30mn ago and since then basically the
>> > Producer is dying with:
>> >
>> > Event queue is full of unsent messages, could not send event:
>> > KeyedMessage(my_topic,[B@1b71b7a6,[B@35fdd1e7)
>> > kafka.common.QueueFullException: Event queue is full of unsent messages,
>> > could not send event: KeyedMessage(my_topic,[B@1b71b7a6,[B@35fdd1e7)
>> > at kafka.producer.Producer$$anonfun$asyncSend$1.apply(Unknown Source)
>> > ~[kafka_2.10-0.8.0.jar:0.8.0]
>> > at kafka.producer.Producer$$anonfun$asyncSend$1.apply(Unknown Source)
>> > ~[kafka_2.10-0.8.0.jar:0.8.0]
>> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> > ~[scala-library-2.10.3.jar:na]
>> > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>> > ~[scala-library-2.10.3.jar:na]
>> > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>> > ~[scala-library-2.10.3.jar:na]
>> > at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>> > ~[scala-library-2.10.3.jar:na]
>> > at kafka.producer.Producer.asyncSend(Unknown Source)
>> > ~[kafka_2.10-0.8.0.jar:0.8.0]
>> > at kafka.producer.Producer.send(Unknown Source)
>> > ~[kafka_2.10-0.8.0.jar:0.8.0]
>> > at kafka.javaapi.producer.Producer.send(Unknown Source)
>> > ~[kafka_2.10-0.8.0.jar:0.8.0]
>> >
>> > It seems like it cannot recover for some reasons. The new leaders were
>> > elected it seems like so it should have picked up the new meta data
>> > information about the partitions. Is this something known from 0.8.0?
>> What
>> > should be looking for to debug/fix this?
>> >
>> > Thanks,
>> >
>> > On Tue, Oct 14, 2014 at 2:22 PM, Neha Narkhede <neha.narkh...@gmail.com
>> >
>> > wrote:
>> >
>> > > Regarding (1), I am assuming that it is expected that brokers going
>> down
>> > > will be brought back up soon. At which point, they will pick up from
>> the
>> > > current leader and get back into the ISR. Am I right?
>> > >
>> > > The broker will be added back to the ISR once it is restarted, but it
>> > never
>> > > goes out of the replica list until the admin explicitly moves it using
>> > the
>> > > reassign partitions tool.
>> > >
>> > > Regarding (2), I finally kicked off a reassign_partitions admin task
>> > adding
>> > > broker 7 to the replicas list for partition 0 which finally fixed the
>> > under
>> > > replicated issue:
>> > > Is this therefore expected that the user will fix up the under
>> > replication
>> > > situation?
>> > >
>> > > Yes. Currently, partition reassignment is purely an admin initiated
>> task.
>> > >
>> > > Another thing I'd like to clarify is that for another topic Y, broker
>> 5
>> > was
>> > > never removed from the ISR array. Note that Y is an unused topic so I
>> am
>> > > guessing that technically broker 5 is not out of sync... though it is
>> > still
>> > > dead. Is this the expected behavior?
>> > >
>> > > Not really. After replica.lag.time.max.ms (which defaults to 10
>> > seconds),
>> > > the leader should remove the dead broker out of the ISR.
>> > >
>> > > Thanks,
>> > > Neha
>> > >
>> > > On Tue, Oct 14, 2014 at 9:27 AM, Jean-Pascal Billaud <
>> j...@tellapart.com>
>> > > wrote:
>> > >
>> > > > hey folks,
>> > > >
>> > > > I have been testing a kafka cluster of 10 nodes on AWS using version
>> > > > 2.8.0-0.8.0
>> > > > and see some behavior on failover that I want to make sure I
>> > understand.
>> > > >
>> > > > Initially, I have a topic X with 30 partitions and a replication
>> factor
>> > > of
>> > > > 3. Looking at the partition 0:
>> > > > partition: 0 - leader: 5 preferred leader: 5 brokers: [5, 3, 4]
>> > in-sync:
>> > > > [5, 3, 4]
>> > > >
>> > > > While killing broker 5, the controller immediately grab the next
>> > replica
>> > > in
>> > > > the ISR and assign it as a leader:
>> > > > partition: 0 - leader: 3 preferred leader: 5 brokers: [5, 3, 4]
>> > in-sync:
>> > > > [3, 4]
>> > > >
>> > > > There are couple of things at this point I would like to clarify:
>> > > >
>> > > > (1) Why is broker 5 still in the brokers array for partition 0? Note
>> > this
>> > > > broker array comes from a get of the zookeeper path
>> > > /brokers/topics/[topic]
>> > > > as documented.
>> > > > (2) Partition 0 is now under replicated and the controller does not
>> > seem
>> > > to
>> > > > do anything about. Is this expected?
>> > > >
>> > > > Regarding (1), I am assuming that it is expected that brokers going
>> > down
>> > > > will be brought back up soon. At which point, they will pick up from
>> > the
>> > > > current leader and get back into the ISR. Am I right?
>> > > >
>> > > > Regarding (2), I finally kicked off a reassign_partitions admin task
>> > > adding
>> > > > broker 7 to the replicas list for partition 0 which finally fixed
>> the
>> > > under
>> > > > replicated issue:
>> > > >
>> > > > partition: 0 - leader: 3  expected_leader: 3  brokers: [3, 4, 7]
>> > > in-sync:
>> > > > [3, 4, 7]
>> > > >
>> > > > Is this therefore expected that the user will fix up the under
>> > > replication
>> > > > situation? Or maybe it is expected again that broker 5 will come
>> back
>> > > soon
>> > > > and this whole thing is a non-issue once that's true given that
>> > > > decommissioning brokers is not something supported as of the kafka
>> > > version
>> > > > I am using.
>> > > >
>> > > > Another thing I'd like to clarify is that for another topic Y,
>> broker 5
>> > > was
>> > > > never removed from the ISR array. Note that Y is an unused topic so
>> I
>> > am
>> > > > guessing that technically broker 5 is not out of sync... though it
>> is
>> > > still
>> > > > dead. Is this the expected behavior?
>> > > >
>> > > > I'd really appreciate somebody to confirm my understanding,
>> > > >
>> > > > Thanks,
>> > > >
>> > >
>> >
>>
>
>

Re: Broker brought down and under replicated partitions

Reply via email to