Hi There,
I’m new to storm. I set up a storm cluster with 3 machines. One machine I use
it as master, I start up zookeeper and nimbus server on it, another 2 machines
I use them as supervisor. I got a very high rate of failure of my spout and I
can not figure out why. Please see the detailed statistics as below.
opology stats
Window Emitted Transferred Complete latency (ms) Acked Failed
10m
0s<http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637?window=600>
32 32 124856.430 7 6
3h 0m
0s<http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637?window=10800>
15780 15780 108286.484 2885 6435
1d 0h 0m
0s<http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637?window=86400>
15780 15780 108286.484 2885 6435
All
time<http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637?window=%3Aall-time>
15780 15780 108286.484 2885 6435
Spouts (All time)
Id Executors Tasks Emitted Transferred Complete latency (ms)
Acked Failed Last error
JMS_QUEUE_SPOUT<http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637/component/JMS_QUEUE_SPOUT>
2 2 9320 9320 108286.484 2885 6435
Bolts (All time)
Id Executors Tasks Emitted Transferred Capacity (last 10m)
Execute latency (ms) Executed Process latency (ms) Acked Failed
Last error
AGGREGATOR_BOLT<http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637/component/AGGREGATOR_BOLT>
8 8 3230 3230 0.000 61.529 3230 58.975 3230 0
MESSAGEFILTER_BOLT<http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637/component/MESSAGEFILTER_BOLT>
8 8 3230 3230 0.000 23.207 9320 18.054 9320 0
OFFER_GENERATOR_BOLT<http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637/component/OFFER_GENERATOR_BOLT>
8 8 0 0 0.000 38.267 3230 34.693 3230 0
For the spout , only 2885 tuples got acked and 6435 got failed. The complete
latency is horribly hight.
Here is my storm configuration :
Topology Configuration
Key Value
zmq.threads 1
zmq.linger.millis 5000
zmq.hwm 0
worker.heartbeat.frequency.secs 1
worker.childopts -Xmx768m -Djava.net.preferIPv4Stack=false
-DNEARLINE_DATA_ENV=dev -DNEARLINE_APP_ENV=dev -DNEARLINE_QUEUES_ENV=dev
-Dauthfilter.appcred.default.encrypt.file=/home/xwei/FP_AppCred_Encrypt.txt
-Dauthfilter.appcred.default.passphrase.file=/home/xwei/FP_AppCred_Passphrase.txt
ui.port 8080
ui.childopts -Xmx768m
transactional.zookeeper.servers
transactional.zookeeper.root /transactional
transactional.zookeeper.port
topology.workers 4
topology.worker.shared.thread.pool.size 4
topology.worker.childopts
topology.tuple.serializer
backtype.storm.serialization.types.ListDelegateSerializer
topology.trident.batch.emit.interval.millis 500
topology.transfer.buffer.size 32
topology.tick.tuple.freq.secs
topology.tasks
topology.stats.sample.rate 1
topology.state.synchronization.timeout.secs 60
topology.spout.wait.strategy backtype.storm.spout.SleepSpoutWaitStrategy
topology.sleep.spout.wait.strategy.time.ms 1
topology.skip.missing.kryo.registrations false
topology.receiver.buffer.size 8
topology.optimize true
topology.name nearline
topology.message.timeout.secs 30
topology.max.task.parallelism
topology.max.spout.pending
topology.max.error.report.per.interval 5
topology.kryo.register
topology.kryo.factory backtype.storm.serialization.DefaultKryoFactory
topology.kryo.decorators []
topology.fall.back.on.java.serialization true
topology.executor.send.buffer.size 16384
topology.executor.receive.buffer.size 16384
topology.error.throttle.interval.secs 10
topology.enable.message.timeouts true
topology.disruptor.wait.strategy com.lmax.disruptor.BlockingWaitStrategy
topology.debug false
topology.builtin.metrics.bucket.size.secs 60
topology.acker.executors 4
task.refresh.poll.secs 10
task.heartbeat.frequency.secs 3
supervisor.worker.timeout.secs 30
supervisor.worker.start.timeout.secs 120
supervisor.slots.ports [6700 6701 6702 6703]
supervisor.monitor.frequency.secs 3
supervisor.heartbeat.frequency.secs 5
supervisor.enable true
supervisor.childopts -Xmx256m -Djava.net.preferIPv4Stack=true
storm.zookeeper.session.timeout 20000
storm.zookeeper.servers ["zookeeper"]
storm.zookeeper.root /storm
storm.zookeeper.retry.times 5
storm.zookeeper.retry.intervalceiling.millis 30000
storm.zookeeper.retry.interval 1000
storm.zookeeper.port 2181
storm.zookeeper.connection.timeout 15000
storm.thrift.transport backtype.storm.security.auth.SimpleTransportPlugin
storm.messaging.transport backtype.storm.messaging.zmq
storm.messaging.netty.server_worker_threads 1
storm.messaging.netty.min_wait_ms 100
storm.messaging.netty.max_wait_ms 1000
storm.messaging.netty.max_retries 30
storm.messaging.netty.client_worker_threads 1
storm.messaging.netty.buffer_size 5242880
storm.local.mode.zmq false
storm.local.dir /app_local/storm
storm.id nearline-1-1406570637
storm.cluster.mode distributed
nimbus.topology.validator backtype.storm.nimbus.DefaultTopologyValidator
nimbus.thrift.port 6627
nimbus.task.timeout.secs 30
nimbus.task.launch.secs 120
nimbus.supervisor.timeout.secs 60
nimbus.reassign true
nimbus.monitor.freq.secs 10
nimbus.inbox.jar.expiration.secs 3600
nimbus.host zookeeper
nimbus.file.copy.expiration.secs 600
nimbus.cleanup.inbox.freq.secs 600
nimbus.childopts -Xmx1024m -Djava.net.preferIPv4Stack=true
logviewer.port 8000
logviewer.childopts -Xmx128m
logviewer.appender.name A1
java.library.path /usr/local/lib
drpc.worker.threads 64
drpc.request.timeout.secs 600
drpc.queue.size 128
drpc.port 3772
drpc.invocations.port 3773
drpc.childopts -Xmx768m
dev.zookeeper.path /tmp/dev-storm-zookeeper
I have several questions:
1. What is complete latency for spout ? According to the google, complete
latency means The timer is started when the tuple is emitted from the spout
and it is stopped when the tuple is acked. So it measures the time for the
entire tuple tree to be completed. As far as I understand, the spout emit a
tuple, the tuple would be passed to the bolt and be parsed and processed there,
then the spout would ack it. This means it might be my bolt which spend to much
time to process the tuple. But according to the above data, it seems the bolt
’s execution latency and process latency is not high at all compared with
complete latency of spout, which is contradict to my understanding.
2. What is the relationship between the complete latency and failure rate?
Also what is the relationship between complete latency and the total throuput?
3. Is there any hint how should I deal with this high failure rate issue?
Thanks a lot for the help.