Do you see the data loss warning after a controlled shutdown? It isn't very clear from your original message whether that is associated with a shutdown operation.
We have a test setup similar to what you are describing - i.e., continuous rolling bounces of a test cluster (while there is traffic flowing into it through mirror makers). For each broker: wait until under-replicated-partition count on every broker is zero, then proceed to do a controlled shutdown of that broker. Thanks, Joel On Wed, Apr 09, 2014 at 09:02:45AM -0400, Alex Gray wrote: > Thanks Joel and Guozhang! > The data retention is 72 hours. > Graceful shutdown is done via SIGTERM, and > controlled.shutdown.enabled=true is in the config. > I do see 'Controlled shutdown succeeded' in the broker log when I > shut it down. > > With both your responses, I feel as if brokers are indeed setup and > functioning correctly. > > I want to ask the developers if I can run a write a script that > gracefully restarts each broker randomly throughout the entire day, > 24/7 :) > > That should weed out any issues. > > Thanks guys, > > Alex > > > On Tue Apr 8 20:38:15 2014, Joel Koshy wrote: > >Also, when you say "graceful shutdown" you mean you issue SIGTERM? Do > >you have controlled.shutdown.enable=true in the broker config. If that > >is set and the controlled shutdown succeeds (i.e., if you see > >'Controlled shutdown succeeded' in the broker log) then you shouldn't > >be seeing the data loss warning in your controller log during the > >shutdown and restarts. Or are you seeing it at other times as well? > > > >WRT the OffsetOutOfRangeException: is your broker down for a long > >period? Do you have a very low retention setting for your topics? Or > >are you bringing up a consumer that has been down for a long period? > > > >Thanks, > > > >Joel > > > >On Tue, Apr 08, 2014 at 04:58:08PM -0700, Guozhang Wang wrote: > >>Hi Alex, > >> > >>1. There is no "cool-off" time since the rebalance should be done before > >>the server complete shutdown. > >> > >>2. The logs are indicating there is possible data loss, which is "expected" > >>if your producer's required.ack config is <= 1 but not == -1. If you do not > >>want data loss, you can change that config value in your producer clients > >>to be > 1, which will effectively trade some latency and availability for > >>consistency. > >> > >>Guozhang > >> > >> > >>On Tue, Apr 8, 2014 at 9:51 AM, Alex Gray <alex.g...@inin.com> wrote: > >> > >>>We have 3 Zookeepers and 3 Kafka Brokers, version 0.8.0. > >>> > >>>I gracefully shutdown one of the kafka brokers. > >>> > >>>Question 1: Should I wait some time before starting the broker back up, > >>>or can I restart it as soon as possible? In other words, do I have to wait > >>>for the other brokers to "re-balance (or whatever they do)" before starting > >>>it back up? > >>> > >>>Question 2: Every once in a while, I get the following exception when the > >>>kafka broker is starting up. Is this bad? Searching around the > >>>newsgroups, I could not get a definitive answer. Example: > >>>http://grokbase.com/t/kafka/users/13cq54bx5q/understanding- > >>>offsetoutofrangeexceptions > >>>http://grokbase.com/t/kafka/users/1413hp296y/trouble- > >>>recovering-after-a-crashed-broker > >>> > >>>Here is the exception: > >>>[2014-04-08 00:02:40,555] ERROR [KafkaApi-3] Error when processing fetch > >>>request for partition [KeyPairGenerated,0] offset 514 from consumer with > >>>correlation id 85 (kafka.server.KafkaApis) > >>>kafka.common.OffsetOutOfRangeException: Request for offset 514 but we > >>>only have log segments in the range 0 to 0. > >>> at kafka.log.Log.read(Log.scala:429) > >>> at kafka.server.KafkaApis.kafka$server$KafkaApis$$ > >>>readMessageSet(KafkaApis.scala:388) > >>> at kafka.server.KafkaApis$$anonfun$kafka$server$ > >>>KafkaApis$$readMessageSets$1.apply(KafkaApis.scala:334) > >>> at kafka.server.KafkaApis$$anonfun$kafka$server$ > >>>KafkaApis$$readMessageSets$1.apply(KafkaApis.scala:330) > >>> at scala.collection.TraversableLike$$anonfun$map$ > >>>1.apply(TraversableLike.scala:206) > >>> at scala.collection.TraversableLike$$anonfun$map$ > >>>1.apply(TraversableLike.scala:206) > >>> at scala.collection.immutable.Map$Map1.foreach(Map.scala:105) > >>> at scala.collection.TraversableLike$class.map( > >>>TraversableLike.scala:206) > >>> at scala.collection.immutable.Map$Map1.map(Map.scala:93) > >>> at kafka.server.KafkaApis.kafka$server$KafkaApis$$ > >>>readMessageSets(KafkaApis.scala:330) > >>> at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:296) > >>> at kafka.server.KafkaApis.handle(KafkaApis.scala:66) > >>> at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42) > >>> at java.lang.Thread.run(Thread.java:722) > >>> > >>>And in the controller.log, I see every once in a while something like: > >>> > >>>controller.log.2014-04-01-04:[2014-04-01 04:42:41,713] WARN [ > >>>OfflinePartitionLeaderSelector]: No broker in ISR is alive for > >>>[KeyPairGenerated,0]. Elect leader 3 from live brokers 3. There's potential > >>>data loss. (kafka.controller.OfflinePartitionLeaderSelector) > >>> > >>>(Which I did via: grep "data loss" *) > >>> > >>>I'm not a programmer: I am the admin for these machines, and I just want > >>>to make sure everything is cool. > >>>Oh, the server.properties has: > >>>default.replication.factor=3 > >>> > >>>Thanks, > >>> > >>>Alex > >>> > >>> > >> > >> > >>-- > >>-- Guozhang > > > > -- > *Alex Gray* | DevOps Engineer, PureCloud > Phone +1.317.493.4291 | mobile +1.857.636.2810 > *Interactive Intelligence* > Deliberately Innovative > www.inin.com <http://www.inin.com/> >