Re: Unable to start Kafka cluster after crash (0.8.2.2)

Alexis Midon Wed, 24 Feb 2016 11:39:12 -0800

regarding the "Allocation Failure" messages, these are not errors, it's the
standard behavior of a generational GC. I let you google the details, there
are tons of resources.
for ex,
https://plumbr.eu/blog/garbage-collection/understanding-garbage-collection-logs


I believe you should stop the broker 1, and wipe out the data for the
topic. Once restarted, replication will restore the data.

On Wed, Feb 24, 2016 at 8:22 AM Anthony Sparks <anthony.spark...@gmail.com>
wrote:

> Hello,
>
> Our Kafka cluster (3 servers, each server has Zookeeper and Kafka installed
> and running) crashed, and actually out of the 6 processes only one
> Zookeeper instance remained alive.  The logs do not indicate much, the only
> errors shown were:
>
> *2016-02-21T12:21:36.881+0000: 27445381.013: [GC (Allocation Failure)
> 27445381.013: [ParNew: 136472K->159K(153344K), 0.0047077 secs]
> 139578K->3265K(507264K), 0.0048552 secs] [Times: user=0.01 sys=0.00,
> real=0.01 secs]*
>
> These errors were both in the Zookeeper and the Kafka logs, and it appears
> they have been happening everyday (with no impact on Kafka, except for
> maybe now?).
>
> The crash is concerning, but not as concerning as what we are encountering
> right now.  I am unable to get the cluster back up.  Two of the three nodes
> halt with this fatal error:
>
> *[2016-02-23 21:18:47,251] FATAL [ReplicaFetcherThread-0-0], Halting
> because log truncation is not allowed for topic audit_data, Current leader
> 0's latest offset 52844816 is less than replica 1's latest offset 52844835
> (kafka.server.ReplicaFetcherThread)*
>
> The other node that manages to stay alive is unable to fulfill writes
> because we have min.ack set to 2 on the producers (requiring at least two
> nodes to be available).  We could change this, but that doesn't fix our
> overall problem.
>
> In browsing the Kafka code, in ReplicaFetcherThread.scala there is this
> little nugget:
>
> *// Prior to truncating the follower's log, ensure that doing so is not
> disallowed by the configuration for unclean leader election.*
> *// This situation could only happen if the unclean election configuration
> for a topic changes while a replica is down. Otherwise,*
> *// we should never encounter this situation since a non-ISR leader cannot
> be elected if disallowed by the broker configuration.*
> *if (!LogConfig.fromProps(brokerConfig.toProps,
> AdminUtils.fetchTopicConfig(replicaMgr.zkClient,*
> *topicAndPartition.topic)).uncleanLeaderElectionEnable) {*
> *    // Log a fatal error and shutdown the broker to ensure that data loss
> does not unexpectedly occur.*
> *    fatal("Halting because log truncation is not allowed for topic
> %s,".format(topicAndPartition.topic) +*
> *      " Current leader %d's latest offset %d is less than replica %d's
> latest offset %d"*
> *      .format(sourceBroker.id, leaderEndOffset, brokerConfig.brokerId,
> replica.logEndOffset.messageOffset))*
> *    Runtime.getRuntime.halt(1)*
> *}*
>
> For each one of our Kafka instances we have them set at:
> *unclean.leader.election.enable=false *which hasn't changed at all since we
> deployed the cluster (verified by file modification stamps).  This to me
> would indicate the above comment assertion is incorrect; we have
> encountered a non-ISR leader elected even though it is configured not to do
> so.
>
> Any ideas on how to work around this?
>
> Thank you,
>
> Tony Sparks
>

Re: Unable to start Kafka cluster after crash (0.8.2.2)

Reply via email to