Complete latency can be higher if you have a lot of data pending from the spout (your topology.max.spout.pending is not set), however given the low number of tuples I wouldn't suspect this to be the case. What is the layout of your topology? Is it circular in any way?
On Wed, Jul 30, 2014 at 12:11 PM, Wei, Xin <[email protected]> wrote: > > > > > Hi There, > > I’m new to storm. I set up a storm cluster with 3 machines. One machine > I use it as master, I start up zookeeper and nimbus server on it, another 2 > machines I use them as supervisor. I got a very high rate of failure of > my spout and I can not figure out why. Please see the detailed statistics > as below. > opology stats Window Emitted Transferred Complete latency (ms) Acked > Failed 10m 0s > <http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637?window=600> > 32 32 124856.430 7 6 3h 0m 0s > <http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637?window=10800> > 15780 15780 108286.484 2885 6435 1d 0h 0m 0s > <http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637?window=86400> > 15780 15780 108286.484 2885 6435 All time > <http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637?window=%3Aall-time> > 15780 15780 108286.484 2885 6435 Spouts (All time) Id Executors Tasks > Emitted Transferred Complete latency (ms) Acked Failed Last error > JMS_QUEUE_SPOUT > <http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637/component/JMS_QUEUE_SPOUT> > 2 2 9320 9320 108286.484 2885 6435 Bolts (All time) > Id Executors Tasks Emitted Transferred Capacity (last 10m) Execute > latency (ms) Executed Process latency (ms) Acked Failed Last error > AGGREGATOR_BOLT > <http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637/component/AGGREGATOR_BOLT> > 8 8 3230 3230 0.000 61.529 3230 58.975 3230 0 MESSAGEFILTER_BOLT > <http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637/component/MESSAGEFILTER_BOLT> > 8 8 3230 3230 0.000 23.207 9320 18.054 9320 0 OFFER_GENERATOR_BOLT > <http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637/component/OFFER_GENERATOR_BOLT> > 8 8 0 0 0.000 38.267 3230 34.693 3230 0 > For the spout , only 2885 tuples got acked and 6435 got failed. The > complete latency is horribly hight. > > Here is my storm configuration : > > Topology Configuration Key Value zmq.threads 1 zmq.linger.millis > 5000 zmq.hwm 0 worker.heartbeat.frequency.secs 1 worker.childopts -Xmx768m > -Djava.net.preferIPv4Stack=false -DNEARLINE_DATA_ENV=dev > -DNEARLINE_APP_ENV=dev -DNEARLINE_QUEUES_ENV=dev > -Dauthfilter.appcred.default.encrypt.file=/home/xwei/FP_AppCred_Encrypt.txt > -Dauthfilter.appcred.default.passphrase.file=/home/xwei/FP_AppCred_Passphrase.txt > ui.port 8080 ui.childopts -Xmx768m transactional.zookeeper.servers > transactional.zookeeper.root /transactional transactional.zookeeper.port > topology.workers 4 topology.worker.shared.thread.pool.size 4 > topology.worker.childopts topology.tuple.serializer > backtype.storm.serialization.types.ListDelegateSerializer > topology.trident.batch.emit.interval.millis 500 > topology.transfer.buffer.size 32 topology.tick.tuple.freq.secs > topology.tasks topology.stats.sample.rate 1 > topology.state.synchronization.timeout.secs 60 > topology.spout.wait.strategy backtype.storm.spout.SleepSpoutWaitStrategy > topology.sleep.spout.wait.strategy.time.ms 1 > topology.skip.missing.kryo.registrations false > topology.receiver.buffer.size 8 topology.optimize true topology.name > nearline topology.message.timeout.secs 30 topology.max.task.parallelism > topology.max.spout.pending topology.max.error.report.per.interval 5 > topology.kryo.register topology.kryo.factory > backtype.storm.serialization.DefaultKryoFactory topology.kryo.decorators > [] topology.fall.back.on.java.serialization true > topology.executor.send.buffer.size 16384 > topology.executor.receive.buffer.size 16384 > topology.error.throttle.interval.secs 10 topology.enable.message.timeouts > true topology.disruptor.wait.strategy > com.lmax.disruptor.BlockingWaitStrategy topology.debug false > topology.builtin.metrics.bucket.size.secs 60 topology.acker.executors 4 > task.refresh.poll.secs 10 task.heartbeat.frequency.secs 3 > supervisor.worker.timeout.secs 30 supervisor.worker.start.timeout.secs > 120 supervisor.slots.ports [6700 6701 6702 6703] > supervisor.monitor.frequency.secs 3 supervisor.heartbeat.frequency.secs 5 > supervisor.enable true supervisor.childopts -Xmx256m > -Djava.net.preferIPv4Stack=true storm.zookeeper.session.timeout 20000 > storm.zookeeper.servers ["zookeeper"] storm.zookeeper.root /storm > storm.zookeeper.retry.times 5 > storm.zookeeper.retry.intervalceiling.millis 30000 > storm.zookeeper.retry.interval 1000 storm.zookeeper.port 2181 > storm.zookeeper.connection.timeout 15000 storm.thrift.transport > backtype.storm.security.auth.SimpleTransportPlugin > storm.messaging.transport backtype.storm.messaging.zmq > storm.messaging.netty.server_worker_threads 1 > storm.messaging.netty.min_wait_ms 100 storm.messaging.netty.max_wait_ms > 1000 storm.messaging.netty.max_retries 30 > storm.messaging.netty.client_worker_threads 1 > storm.messaging.netty.buffer_size 5242880 storm.local.mode.zmq false > storm.local.dir /app_local/storm storm.id nearline-1-1406570637 > storm.cluster.mode distributed nimbus.topology.validator > backtype.storm.nimbus.DefaultTopologyValidator nimbus.thrift.port 6627 > nimbus.task.timeout.secs 30 nimbus.task.launch.secs 120 > nimbus.supervisor.timeout.secs 60 nimbus.reassign true > nimbus.monitor.freq.secs 10 nimbus.inbox.jar.expiration.secs 3600 > nimbus.host zookeeper nimbus.file.copy.expiration.secs 600 > nimbus.cleanup.inbox.freq.secs 600 nimbus.childopts -Xmx1024m > -Djava.net.preferIPv4Stack=true logviewer.port 8000 logviewer.childopts > -Xmx128m logviewer.appender.name A1 java.library.path /usr/local/lib > drpc.worker.threads 64 drpc.request.timeout.secs 600 drpc.queue.size 128 > drpc.port 3772 drpc.invocations.port 3773 drpc.childopts -Xmx768m > dev.zookeeper.path /tmp/dev-storm-zookeeper > I have several questions: > > 1. What is complete latency for spout ? According to the google, > complete latency means *The timer is started when the tuple is > emitted from the spout and it is stopped when the tuple is acked. So it > measures the time for the entire tuple tree to be completed.* As far > as I understand, the spout emit a tuple, the tuple would be passed to the > bolt and be parsed and processed there, then the spout would ack it. This > means it might be my bolt which spend to much time to process the tuple. > But according to the above data, it seems the bolt ’s execution latency and > process latency is not high at all compared with complete latency of spout, > which is contradict to my understanding. > 2. What is the relationship between the complete latency and failure > rate? Also what is the relationship between complete latency and the total > throuput? > 3. Is there any hint how should I deal with this high failure rate > issue? > > Thanks a lot for the help. > > > > >
