[ https://issues.apache.org/jira/browse/STORM-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stig Rohde Døssing closed STORM-1682. ------------------------------------- Resolution: Not A Problem We misidentified the cause of our topologies hanging as this issue when it was actually caused by https://issues.apache.org/jira/browse/STORM-1750. The code change is fairly big, and the only effect is to prevent the spout from pausing consumption from a partition if that partition temporarily has no leader. Since the spout should resume the partition when getBrokersInfo is called, which happens every 30 seconds, it's not really a problem. > Kafka spout can lose partitions > ------------------------------- > > Key: STORM-1682 > URL: https://issues.apache.org/jira/browse/STORM-1682 > Project: Apache Storm > Issue Type: Bug > Components: storm-kafka > Affects Versions: 0.10.0, 1.0.0, 2.0.0 > Reporter: Stig Rohde Døssing > Assignee: Stig Rohde Døssing > Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > The KafkaSpout can lose partitions for a period, or hang because > getBrokersInfo > (https://github.com/apache/storm/blob/master/external/storm-kafka/src/jvm/org/apache/storm/kafka/DynamicBrokersReader.java#L77) > may get a NoNodeException if there is no broker info in Zookeeper > corresponding to the leader id in Zookeeper. When this error occurs, the > spout ignores the partition until the next time getBrokersInfo is called, > which isn't until the next time the spout gets an exception on fetch. If the > timing is really bad, it might ignore all the partitions and never restart. > As far as I'm aware, Kafka doesn't update leader and brokerinfo atomically, > so it's possible to get unlucky and hit the NoNodeException when a broker has > just died. > I have a few suggestions for dealing with this. > getBrokerInfo could simply retry the inner loop over partitions if it gets > the NoNodeException (probably with a limit and a short sleep between > attempts). If it fails repeatedly, the spout should be crashed. > Alternatively the DynamicBrokersReader could instead lookup all brokers in > Zookeeper, create a consumer and send a TopicMetadataRequest on it. The > response contains the leader for each partition and host/port for the > relevant brokers. > Edit: I noticed that the spout periodically refreshes the brokers info, so > the issue isn't as bad as I thought. I still think this change has value, > since it avoids the spout temporarily dropping a partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332)