regarding the "Allocation Failure" messages, these are not errors, it's the standard behavior of a generational GC. I let you google the details, there are tons of resources. for ex, https://plumbr.eu/blog/garbage-collection/understanding-garbage-collection-logs
I believe you should stop the broker 1, and wipe out the data for the topic. Once restarted, replication will restore the data. On Wed, Feb 24, 2016 at 8:22 AM Anthony Sparks <anthony.spark...@gmail.com> wrote: > Hello, > > Our Kafka cluster (3 servers, each server has Zookeeper and Kafka installed > and running) crashed, and actually out of the 6 processes only one > Zookeeper instance remained alive. The logs do not indicate much, the only > errors shown were: > > *2016-02-21T12:21:36.881+0000: 27445381.013: [GC (Allocation Failure) > 27445381.013: [ParNew: 136472K->159K(153344K), 0.0047077 secs] > 139578K->3265K(507264K), 0.0048552 secs] [Times: user=0.01 sys=0.00, > real=0.01 secs]* > > These errors were both in the Zookeeper and the Kafka logs, and it appears > they have been happening everyday (with no impact on Kafka, except for > maybe now?). > > The crash is concerning, but not as concerning as what we are encountering > right now. I am unable to get the cluster back up. Two of the three nodes > halt with this fatal error: > > *[2016-02-23 21:18:47,251] FATAL [ReplicaFetcherThread-0-0], Halting > because log truncation is not allowed for topic audit_data, Current leader > 0's latest offset 52844816 is less than replica 1's latest offset 52844835 > (kafka.server.ReplicaFetcherThread)* > > The other node that manages to stay alive is unable to fulfill writes > because we have min.ack set to 2 on the producers (requiring at least two > nodes to be available). We could change this, but that doesn't fix our > overall problem. > > In browsing the Kafka code, in ReplicaFetcherThread.scala there is this > little nugget: > > *// Prior to truncating the follower's log, ensure that doing so is not > disallowed by the configuration for unclean leader election.* > *// This situation could only happen if the unclean election configuration > for a topic changes while a replica is down. Otherwise,* > *// we should never encounter this situation since a non-ISR leader cannot > be elected if disallowed by the broker configuration.* > *if (!LogConfig.fromProps(brokerConfig.toProps, > AdminUtils.fetchTopicConfig(replicaMgr.zkClient,* > *topicAndPartition.topic)).uncleanLeaderElectionEnable) {* > * // Log a fatal error and shutdown the broker to ensure that data loss > does not unexpectedly occur.* > * fatal("Halting because log truncation is not allowed for topic > %s,".format(topicAndPartition.topic) +* > * " Current leader %d's latest offset %d is less than replica %d's > latest offset %d"* > * .format(sourceBroker.id, leaderEndOffset, brokerConfig.brokerId, > replica.logEndOffset.messageOffset))* > * Runtime.getRuntime.halt(1)* > *}* > > For each one of our Kafka instances we have them set at: > *unclean.leader.election.enable=false *which hasn't changed at all since we > deployed the cluster (verified by file modification stamps). This to me > would indicate the above comment assertion is incorrect; we have > encountered a non-ISR leader elected even though it is configured not to do > so. > > Any ideas on how to work around this? > > Thank you, > > Tony Sparks >