No broker restarts. Created a kafka issue: https://issues.apache.org/jira/browse/KAFKA-2011 <https://issues.apache.org/jira/browse/KAFKA-2011>
>> Logs for rebalance: >> [2015-03-07 16:52:48,969] INFO [Controller 2]: Resuming preferred replica >> election for partitions: (kafka.controller.KafkaController) >> [2015-03-07 16:52:48,969] INFO [Controller 2]: Partitions that completed >> preferred replica election: (kafka.controller.KafkaController) >> … >> [2015-03-07 12:07:06,783] INFO [Controller 4]: Resuming preferred replica >> election for partitions: (kafka.controller.KafkaController) >> ... >> [2015-03-07 09:10:41,850] INFO [Controller 3]: Resuming preferred replica >> election for partitions: (kafka.controller.KafkaController) >> ... >> [2015-03-07 08:26:56,396] INFO [Controller 1]: Starting preferred replica >> leader election for partitions (kafka.controller.KafkaController) >> ... >> [2015-03-06 16:52:59,506] INFO [Controller 2]: Partitions undergoing >> preferred replica election: (kafka.controller.KafkaController) >> >> Also, I still see lots of below errors (~69k) going on in the logs since the >> restart. Is there any other reason than rebalance for these errors? >> >> [2015-03-07 14:23:28,963] ERROR [ReplicaFetcherThread-2-5], Error for >> partition [Topic-11,7] to broker 5:class >> kafka.common.NotLeaderForPartitionException >> (kafka.server.ReplicaFetcherThread) >> [2015-03-07 14:23:28,963] ERROR [ReplicaFetcherThread-1-5], Error for >> partition [Topic-2,25] to broker 5:class >> kafka.common.NotLeaderForPartitionException >> (kafka.server.ReplicaFetcherThread) >> [2015-03-07 14:23:28,963] ERROR [ReplicaFetcherThread-2-5], Error for >> partition [Topic-2,21] to broker 5:class >> kafka.common.NotLeaderForPartitionException >> (kafka.server.ReplicaFetcherThread) >> [2015-03-07 14:23:28,963] ERROR [ReplicaFetcherThread-1-5], Error for >> partition [Topic-22,9] to broker 5:class >> kafka.common.NotLeaderForPartitionException >> (kafka.server.ReplicaFetcherThread) > Could you paste the related logs in controller.log? What specifically should I search for in the logs? Thanks, Zakee > On Mar 9, 2015, at 11:35 AM, Jiangjie Qin <j...@linkedin.com.INVALID > <mailto:j...@linkedin.com.INVALID>> wrote: > > Is there anything wrong with brokers around that time? E.g. Broker restart? > The log you pasted are actually from replica fetchers. Could you paste the > related logs in controller.log? > > Thanks. > > Jiangjie (Becket) Qin > > On 3/9/15, 10:32 AM, "Zakee" <kzak...@netzero.net > <mailto:kzak...@netzero.net>> wrote: > >> Correction: Actually the rebalance happened quite until 24 hours after >> the start, and thats where below errors were found. Ideally rebalance >> should not have happened at all. >> >> >> Thanks >> Zakee >> >> >> >>> On Mar 9, 2015, at 10:28 AM, Zakee <kzak...@netzero.net >>> <mailto:kzak...@netzero.net>> wrote: >>> >>>> Hmm, that sounds like a bug. Can you paste the log of leader rebalance >>>> here? >>> Thanks for you suggestions. >>> It looks like the rebalance actually happened only once soon after I >>> started with clean cluster and data was pushed, it didn’t happen again >>> so far, and I see the partitions leader counts on brokers did not change >>> since then. One of the brokers was constantly showing 0 for partition >>> leader count. Is that normal? >>> >>> Also, I still see lots of below errors (~69k) going on in the logs >>> since the restart. Is there any other reason than rebalance for these >>> errors? >>> >>> [2015-03-07 14:23:28,963] ERROR [ReplicaFetcherThread-2-5], Error for >>> partition [Topic-11,7] to broker 5:class >>> kafka.common.NotLeaderForPartitionException >>> (kafka.server.ReplicaFetcherThread) >>> [2015-03-07 14:23:28,963] ERROR [ReplicaFetcherThread-1-5], Error for >>> partition [Topic-2,25] to broker 5:class >>> kafka.common.NotLeaderForPartitionException >>> (kafka.server.ReplicaFetcherThread) >>> [2015-03-07 14:23:28,963] ERROR [ReplicaFetcherThread-2-5], Error for >>> partition [Topic-2,21] to broker 5:class >>> kafka.common.NotLeaderForPartitionException >>> (kafka.server.ReplicaFetcherThread) >>> [2015-03-07 14:23:28,963] ERROR [ReplicaFetcherThread-1-5], Error for >>> partition [Topic-22,9] to broker 5:class >>> kafka.common.NotLeaderForPartitionException >>> (kafka.server.ReplicaFetcherThread) >>> >>>> Some other things to check are: >>>> 1. The actual property name is auto.leader.rebalance.enable, not >>>> auto.leader.rebalance. You’ve probably known this, just to double >>>> confirm. >>> Yes >>> >>>> 2. In zookeeper path, can you verify /admin/preferred_replica_election >>>> does not exist? >>> ls /admin >>> [delete_topics] >>> ls /admin/preferred_replica_election >>> Node does not exist: /admin/preferred_replica_election >>> >>> >>> Thanks >>> Zakee >>> >>> >>> >>>> On Mar 7, 2015, at 10:49 PM, Jiangjie Qin <j...@linkedin.com.INVALID >>>> <mailto:j...@linkedin.com.INVALID>> >>>> wrote: >>>> >>>> Hmm, that sounds like a bug. Can you paste the log of leader rebalance >>>> here? >>>> Some other things to check are: >>>> 1. The actual property name is auto.leader.rebalance.enable, not >>>> auto.leader.rebalance. You’ve probably known this, just to double >>>> confirm. >>>> 2. In zookeeper path, can you verify /admin/preferred_replica_election >>>> does not exist? >>>> >>>> Jiangjie (Becket) Qin >>>> >>>> On 3/7/15, 10:24 PM, "Zakee" <kzak...@netzero.net >>>> <mailto:kzak...@netzero.net>> wrote: >>>> >>>>> I started with clean cluster and started to push data. It still does >>>>> the >>>>> rebalance at random durations even though the auto.leader.relabalance >>>>> is >>>>> set to false. >>>>> >>>>> Thanks >>>>> Zakee >>>>> >>>>> >>>>> >>>>>> On Mar 6, 2015, at 3:51 PM, Jiangjie Qin <j...@linkedin.com.INVALID >>>>>> <mailto:j...@linkedin.com.INVALID>> >>>>>> wrote: >>>>>> >>>>>> Yes, the rebalance should not happen in that case. That is a little >>>>>> bit >>>>>> strange. Could you try to launch a clean Kafka cluster with >>>>>> auto.leader.election disabled and try push data? >>>>>> When leader migration occurs, NotLeaderForPartition exception is >>>>>> expected. >>>>>> >>>>>> Jiangjie (Becket) Qin >>>>>> >>>>>> >>>>>> On 3/6/15, 3:14 PM, "Zakee" <kzak...@netzero.net >>>>>> <mailto:kzak...@netzero.net>> wrote: >>>>>> >>>>>>> Yes, Jiangjie, I do see lots of these errors "Starting preferred >>>>>>> replica >>>>>>> leader election for partitions” in logs. I also see lot of Produce >>>>>>> request failure warnings in with the NotLeader Exception. >>>>>>> >>>>>>> I tried switching off the auto.leader.relabalance to false. I am >>>>>>> still >>>>>>> noticing the rebalance happening. My understanding was the rebalance >>>>>>> will >>>>>>> not happen when this is set to false. >>>>>>> >>>>>>> Thanks >>>>>>> Zakee >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Feb 25, 2015, at 5:17 PM, Jiangjie Qin >>>>>>>> <j...@linkedin.com.INVALID <mailto:j...@linkedin.com.INVALID>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> I don’t think num.replica.fetchers will help in this case. >>>>>>>> Increasing >>>>>>>> number of fetcher threads will only help in cases where you have a >>>>>>>> large >>>>>>>> amount of data coming into a broker and more replica fetcher >>>>>>>> threads >>>>>>>> will >>>>>>>> help keep up. We usually only use 1-2 for each broker. But in your >>>>>>>> case, >>>>>>>> it looks that leader migration cause issue. >>>>>>>> Do you see anything else in the log? Like preferred leader >>>>>>>> election? >>>>>>>> >>>>>>>> Jiangjie (Becket) Qin >>>>>>>> >>>>>>>> On 2/25/15, 5:02 PM, "Zakee" <kzak...@netzero.net >>>>>>>> <mailto:kzak...@netzero.net> >>>>>>>> <mailto:kzak...@netzero.net <mailto:kzak...@netzero.net>>> wrote: >>>>>>>> >>>>>>>>> Thanks, Jiangjie. >>>>>>>>> >>>>>>>>> Yes, I do see under partitions usually shooting every hour. >>>>>>>>> Anythings >>>>>>>>> that >>>>>>>>> I could try to reduce it? >>>>>>>>> >>>>>>>>> How does "num.replica.fetchers" affect the replica sync? Currently >>>>>>>>> have >>>>>>>>> configured 7 each of 5 brokers. >>>>>>>>> >>>>>>>>> -Zakee >>>>>>>>> >>>>>>>>> On Wed, Feb 25, 2015 at 4:17 PM, Jiangjie Qin >>>>>>>>> <j...@linkedin.com.invalid <mailto:j...@linkedin.com.invalid>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> These messages are usually caused by leader migration. I think as >>>>>>>>>> long >>>>>>>>>> as >>>>>>>>>> you don¹t see this lasting for ever and got a bunch of under >>>>>>>>>> replicated >>>>>>>>>> partitions, it should be fine. >>>>>>>>>> >>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>>>> >>>>>>>>>> On 2/25/15, 4:07 PM, "Zakee" <kzak...@netzero.net >>>>>>>>>> <mailto:kzak...@netzero.net>> wrote: >>>>>>>>>> >>>>>>>>>>> Need to know if I should I be worried about this or ignore them. >>>>>>>>>>> >>>>>>>>>>> I see tons of these exceptions/warnings in the broker logs, not >>>>>>>>>>> sure >>>>>>>>>> what >>>>>>>>>>> causes them and what could be done to fix them. >>>>>>>>>>> >>>>>>>>>>> ERROR [ReplicaFetcherThread-3-5], Error for partition >>>>>>>>>>> [TestTopic] >>>>>>>>>>> to >>>>>>>>>>> broker >>>>>>>>>>> 5:class kafka.common.NotLeaderForPartitionException >>>>>>>>>>> (kafka.server.ReplicaFetcherThread) >>>>>>>>>>> [2015-02-25 11:01:41,785] ERROR [ReplicaFetcherThread-3-5], >>>>>>>>>>> Error >>>>>>>>>>> for >>>>>>>>>>> partition [TestTopic] to broker 5:class >>>>>>>>>>> kafka.common.NotLeaderForPartitionException >>>>>>>>>>> (kafka.server.ReplicaFetcherThread) >>>>>>>>>>> [2015-02-25 11:01:41,785] WARN [Replica Manager on Broker 2]: >>>>>>>>>>> Fetch >>>>>>>>>>> request >>>>>>>>>>> with correlation id 950084 from client ReplicaFetcherThread-1-2 >>>>>>>>>>> on >>>>>>>>>>> partition [TestTopic,2] failed due to Leader not local for >>>>>>>>>>> partition >>>>>>>>>>> [TestTopic,2] on broker 2 (kafka.server.ReplicaManager) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Any ideas? >>>>>>>>>>> >>>>>>>>>>> -Zakee >>>>>>>>>>> ____________________________________________________________ >>>>>>>>>>> Next Apple Sensation >>>>>>>>>>> 1 little-known path to big profits >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> http://thirdpartyoffers.netzero.net/TGL3231/54ee63b9e704b63b94061 >>>>>>>>>>> <http://thirdpartyoffers.netzero.net/TGL3231/54ee63b9e704b63b94061> >>>>>>>>>>> st0 >>>>>>>>>>> 3v >>>>>>>>>>> uc >>>>>>>>>> >>>>>>>>>> ____________________________________________________________ >>>>>>>>>> Extended Stay America >>>>>>>>>> Get Fantastic Amenities, low rates! Kitchen, Ample Workspace, >>>>>>>>>> Free >>>>>>>>>> WIFI >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> http://thirdpartyoffers.netzero.net/TGL3255/54ee66f26da6f66f10ad4m >>>>>>>>>> <http://thirdpartyoffers.netzero.net/TGL3255/54ee66f26da6f66f10ad4m> >>>>>>>>>> p02 >>>>>>>>>> du >>>>>>>>>> c >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ____________________________________________________________ >>>>>>>> Extended Stay America >>>>>>>> Official Site. Free WIFI, Kitchens. Our best rates here, >>>>>>>> guaranteed. >>>>>>>> >>>>>>>> http://thirdpartyoffers.netzero.net/TGL3255/54ee80744cfa7747461mp13d >>>>>>>> <http://thirdpartyoffers.netzero.net/TGL3255/54ee80744cfa7747461mp13d> >>>>>>>> uc >>>>>>>> >>>>>>>> >>>>>>>> <http://thirdpartyoffers.netzero.net/TGL3255/54ee80744cfa7747461mp13 >>>>>>>> duc >>>>>>>>> >>>>>> >>>>>> >>>>>> ____________________________________________________________ >>>>>> The WORST exercise for aging >>>>>> Avoid this "healthy" exercise to look & feel 5-10 years >>>>>> YOUNGER >>>>>> >>>>>> http://thirdpartyoffers.netzero.net/TGL3255/54fa40e98a0e640e81196mp07d >>>>>> <http://thirdpartyoffers.netzero.net/TGL3255/54fa40e98a0e640e81196mp07d> >>>>>> uc >>>>> >>>> >>>> >>>> ____________________________________________________________ >>>> Seabourn Luxury Cruises >>>> Receive special offers from the World's Finest Small-Ship Cruise >>>> Line! >>>> >>>> http://thirdpartyoffers.netzero.net/TGL3255/54fbf3b0f058073b02901mp14duc >>>> <http://thirdpartyoffers.netzero.net/TGL3255/54fbf3b0f058073b02901mp14duc> >>> >> > > > ____________________________________________________________ > Discover Seabourn > A journey as beautiful as the destination, request a brochure today! > http://thirdpartyoffers.netzero.net/TGL3255/54fdebfe6a2a36bfb0bb3mp10duc > <http://thirdpartyoffers.netzero.net/TGL3255/54fdebfe6a2a36bfb0bb3mp10duc> Thanks Zakee