I see that the state-change logs have warning messages of this kind (Broker 7 is the 0.8.1 API and this is a log snippet from that broker) : s associated leader epoch 11 is old. Current leader epoch is 11 (state.change.logger) [2014-04-09 10:32:21,974] WARN Broker 7 ignoring LeaderAndIsr request from controller 1001 with correlation id 0 epoch 7 for partition [pets_nec_buygold,0] since its asso ciated leader epoch 12 is old. Current leader epoch is 12 (state.change.logger) [2014-04-09 10:32:21,974] WARN Broker 7 ignoring LeaderAndIsr request from controller 1001 with correlation id 0 epoch 7 for partition [cafe_notification,0] since its ass ociated leader epoch 11 is old. Current leader epoch is 11 (state.change.logger) [2014-04-09 10:32:21,975] INFO Broker 7 skipped the become-follower state change after marking its partition as follower with correlation id 0 from controller 1001 epoch 6 for partition [set_primary_photo,0] since the new leader 1008 is the same as the old leader (state.change.logger) [2014-04-09 10:32:21,975] INFO Broker 7 skipped the become-follower state change after marking its partition as follower with correlation id 0 from controller 1001 epoch 6 for partition [external_url,0] since the new leader 1001 is the same as the old leader (state.change.logger)
And these are the snippets of the broker log of a 0.8.0 node that I shut down before I tried to upgrade it (this is when most topics became unusable): [2014-04-09 10:32:21,993] WARN Broker 8 ignoring LeaderAndIsr request from controller 1001 with correlation id 0 epoch 7 for partition [variant_assign,0] since its associated leader epoch 11 is old. Current leader epoch is 11 (state.change.logger) [2014-04-09 10:32:21,993] WARN Broker 8 ignoring LeaderAndIsr request from controller 1001 with correlation id 0 epoch 7 for partition [meetme_new_contact_count,0] since its associated leader epoch 8 is old. Current leader epoch is 8 (state.change.logger) [2014-04-09 10:32:21,994] INFO Broker 8 skipped the become-follower state change after marking its partition as follower with correlation id 0 from controller 1001 epoch 6 for partition [m3_auth,0] since the new leader 7 is the same as the old leader (state.change.logger) [2014-04-09 10:32:21,994] INFO Broker 8 skipped the become-follower state change after marking its partition as follower with correlation id 0 from controller 1001 epoch 6 for partition [newsfeed_likes,0] since the new leader 1001 is the same as the old leader (state.change.logger) In terms of upgrading from 0.8.0 to 0.8.1 is there a recommended approach that one should follow? Is it possible to migrate from one version to the next one on a live cluster one server a time? Thanks, Martin On Wed, Apr 9, 2014 at 8:38 PM, Jun Rao <jun...@gmail.com> wrote: > Was there any error in the controller and the state-change logs? > > Thanks, > > Jun > > > On Wed, Apr 9, 2014 at 11:18 AM, Marcin Michalski <mmichal...@tagged.com > >wrote: > > > Hi, has anyone upgraded their kafka from 0.8.0 to 0.8.1 successfully one > > broker at a time on a live cluster? > > > > I am seeing strange behaviors where many of my kafka topics become > unusable > > (by both consumers and producers). When that happens, I see lots of > errors > > in the server logs that look like this: > > > > [2014-04-09 10:38:14,669] WARN [KafkaApi-1007] Fetch request with > > correlation id 2455 from client ReplicaFetcherThread-15-1007 on partition > > [risk,0] failed due to Topic risk either doesn't exist or is in the > process > > of being deleted (kafka.server.KafkaApis) > > [2014-04-09 10:38:14,669] WARN [KafkaApi-1007] Fetch request with > > correlation id 2455 from client ReplicaFetcherThread-7-1007 on partition > > [message,0] failed due to Topic message either doesn't exist or is in the > > process of being deleted (kafka.server.KafkaApis) > > > > When I try to consume a message from a topic that complained about the > > Topic not existing (above warning), I get the below exception: > > > > ....topic message --from-beginning > > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > > SLF4J: Defaulting to no-operation (NOP) logger implementation > > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for > further > > details. > > [2014-04-09 10:40:30,571] WARN > > > > > [console-consumer-90716_dkafkadatahub07.tag-dev.com-1397065229615-7211ba72-leader-finder-thread], > > Failed to add leader for partitions [message,0]; will retry > > (kafka.consumer.ConsumerFetcherManager$LeaderFinderThread) > > kafka.common.UnknownTopicOrPartitionException > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > > at > > > > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > > at > > > > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > > at java.lang.Class.newInstance0(Class.java:355) > > at java.lang.Class.newInstance(Class.java:308) > > at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:79) > > at > > > > > kafka.consumer.SimpleConsumer.earliestOrLatestOffset(SimpleConsumer.scala:167) > > at > > > > > kafka.consumer.ConsumerFetcherThread.handleOffsetOutOfRange(ConsumerFetcherThread.scala:60) > > at > > > > > kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:179) > > at > > > > > kafka.server.AbstractFetcherThread$$anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:174) > > at scala.collection.immutable.Map$Map1.foreach(Map.scala:119) > > at > > > > > kafka.server.AbstractFetcherThread.addPartitions(AbstractFetcherThread.scala:174) > > at > > > > > kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:86) > > at > > > > > kafka.server.AbstractFetcherManager$$anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:76) > > at scala.collection.immutable.Map$Map1.foreach(Map.scala:119) > > at > > > > > kafka.server.AbstractFetcherManager.addFetcherForPartitions(AbstractFetcherManager.scala:76) > > at > > > > > kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:95) > > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51) > > ---------- > > > > *More details about my issues:* > > My current configuration in the environment where I am testing the > upgrade > > is 4 physical servers running 2 brokers each with controlled shutdown > > feature enabled. When I shutdown the 2 brokers on one of the existing > Kafka > > 0.8.0 machines and upgrade that machine to 0.8.1 and restart it, all is > > fine for a bit. Once, the new brokers come up, I ran the > > kafka-preferred-replica-election.sh to make sure that started brokers > > become leaders of existing topics. The replication factor on the topics > is > > set to 4. I tested both producing and consuming messages against brokers > > that were leaders with kafka 0.8.0 and 0.8.1 and no issues were > > encountered. > > > > Later, I tried to perform the control shutdown of the 2 additional > brokers > > on the Kafka server that has 0.8.0 version installed and after the broker > > shutdown and new leaders were assigned, all of my server logs are getting > > filled up with the above exceptions and most of my topics are not > usable. I > > have pulled and build the 0.8.1 kafka code from git last thursday so I > > should be pretty much up to date. So not sure if I am doing something > wrong > > or if migrating from 0.8.0 to 0.8.1 on a live cluster one server at a > time > > is not supported. Is there a recommended migration approach that one > should > > take when migrating from live 0.8.0 to 0.8.1 cluster? > > > > As to who is the leader of one of the topics that became unusable is the > > broker that was successfully upgraded to 0.8.1: > > Topic:message PartitionCount:1 ReplicationFactor:4 Configs: > > Topic: message Partition: 0 * Leader: 1007 * Replicas: > > 1007,8,9,1001 Isr: 1001,1007,8 > > > > Brokers 9 and 1009 where shutdown from one physical server that had kafka > > 0.8.0 installed when these problems started occurring (I was planning to > > upgrade them to 0.8.1). The only way I can recover from this state is to > > shutdown all brokers and delete all of kafka topic logs plus zookeeper > > kafka directory and start with new cluster. > > > > > > Your help in this matter is greatly appreciated. > > > > Thanks, > > Martin > > >