Hi,
For some reason, after a few hours of processing, my topology starts
hanging. In the UI's 'Topology Stats' the emitted and transferred counts
are equal to 0, and I can't see anything coming out of the topology
(usually inserting in some database).
I can't see anything unusual in the storm workers logs, nor in kafka and
zookeeper's logs.
The zkCoordinator keeps refreshing, but nothing happens :
2014-10-31 17:00:13 s.k.ZkCoordinator [INFO] Task [2/2] Deleted partition
managers: []
2014-10-31 17:00:13 s.k.ZkCoordinator [INFO] Task [2/2] New partition
managers: []
2014-10-31 17:00:13 s.k.ZkCoordinator [INFO] Task [2/2] Finished refreshing
2014-10-31 17:00:13 s.k.DynamicBrokersReader [INFO] Read partition info
from zookeeper: GlobalPartitionInformation{...
I don't really understand why this is hanging, and how I could fix this.
I'm using storm 0.9.2-incubating with Kafka 0.8.1.1 and storm-kafka
0.9.2-incubating.
My topology pulls data from 4 different topics in Kafka, and has 9
different bolts. Each bolt implements IBasicBolt. I'm not doing any acking
manually (storm should take care of this for me, right?)
It takes a few second for a tuple to go through the entire topology.
I'm setting a MaxSpoutPending to limit the number of tuples in the topology.
My tuples shouldn't exceed the max size limit (set to default on my kafka
brokers and in my SpoutConfig. And I think the default is rather high and
should easily handle a few lines of text)
The tuples don't necessarily go to each bolt.
I'm defining my spouts like this:
ZkHosts zkHosts = new ZkHosts("zk1.example.com:2181", "
zk2.example.com:2181"...);
zkHosts.refreshFreqSecs = 120;
SpoutConfig kafkaConfig = new SpoutConfig(brokerHosts(),
"TOPIC_NAME",
"/consumers",
"CONSUMER_ID");
kafkaConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
KafkaSpout kafkaSpout = new KafkaSpout(kafkaConfig)
I'm running this topology on 2 different workers, located on two different
supervisors. In total I'm using something like 160 executors.
I would greatly appreciate any help or hints on how to fix/investigate this
problem!
Thanks,
Maxime