FW: why complete latency and failure rate is so high of my spout.

Wei, Xin Wed, 30 Jul 2014 09:19:08 -0700


Hi There,

I’m new to storm. I set up a storm cluster with 3 machines. One machine I use 
it as master, I start up zookeeper and nimbus server on it, another 2 machines 
I use them as supervisor.   I got a very high rate of failure of my spout and I 
can not figure out why. Please see the detailed  statistics as below.
opology stats
Window  Emitted Transferred     Complete latency (ms)   Acked   Failed
10m 
0s<http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637?window=600>
      32      32      124856.430      7       6
3h 0m 
0s<http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637?window=10800>
  15780   15780   108286.484      2885    6435
1d 0h 0m 
0s<http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637?window=86400>
       15780   15780   108286.484      2885    6435
All 
time<http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637?window=%3Aall-time>
    15780   15780   108286.484      2885    6435
Spouts (All time)
Id      Executors       Tasks   Emitted Transferred     Complete latency (ms)   
Acked   Failed  Last error
JMS_QUEUE_SPOUT<http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637/component/JMS_QUEUE_SPOUT>
      2       2       9320    9320    108286.484      2885    6435
Bolts (All time)

Id      Executors       Tasks   Emitted Transferred     Capacity (last 10m)     
Execute latency (ms)    Executed        Process latency (ms)    Acked   Failed  
Last error
AGGREGATOR_BOLT<http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637/component/AGGREGATOR_BOLT>
      8       8       3230    3230    0.000   61.529  3230    58.975  3230    0
MESSAGEFILTER_BOLT<http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637/component/MESSAGEFILTER_BOLT>
        8       8       3230    3230    0.000   23.207  9320    18.054  9320    0
OFFER_GENERATOR_BOLT<http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637/component/OFFER_GENERATOR_BOLT>
    8       8       0       0       0.000   38.267  3230    34.693  3230    0
For the spout , only 2885 tuples got acked and 6435 got failed. The complete 
latency is horribly hight.

Here is my storm configuration :

Topology Configuration
Key     Value
zmq.threads     1
zmq.linger.millis       5000
zmq.hwm 0
worker.heartbeat.frequency.secs 1
worker.childopts        -Xmx768m -Djava.net.preferIPv4Stack=false 
-DNEARLINE_DATA_ENV=dev -DNEARLINE_APP_ENV=dev -DNEARLINE_QUEUES_ENV=dev 
-Dauthfilter.appcred.default.encrypt.file=/home/xwei/FP_AppCred_Encrypt.txt 
-Dauthfilter.appcred.default.passphrase.file=/home/xwei/FP_AppCred_Passphrase.txt
ui.port 8080
ui.childopts    -Xmx768m
transactional.zookeeper.servers
transactional.zookeeper.root    /transactional
transactional.zookeeper.port
topology.workers        4
topology.worker.shared.thread.pool.size 4
topology.worker.childopts
topology.tuple.serializer       
backtype.storm.serialization.types.ListDelegateSerializer
topology.trident.batch.emit.interval.millis     500
topology.transfer.buffer.size   32
topology.tick.tuple.freq.secs
topology.tasks
topology.stats.sample.rate      1
topology.state.synchronization.timeout.secs     60
topology.spout.wait.strategy    backtype.storm.spout.SleepSpoutWaitStrategy
topology.sleep.spout.wait.strategy.time.ms      1
topology.skip.missing.kryo.registrations        false
topology.receiver.buffer.size   8
topology.optimize       true
topology.name   nearline
topology.message.timeout.secs   30
topology.max.task.parallelism
topology.max.spout.pending
topology.max.error.report.per.interval  5
topology.kryo.register
topology.kryo.factory   backtype.storm.serialization.DefaultKryoFactory
topology.kryo.decorators        []
topology.fall.back.on.java.serialization        true
topology.executor.send.buffer.size      16384
topology.executor.receive.buffer.size   16384
topology.error.throttle.interval.secs   10
topology.enable.message.timeouts        true
topology.disruptor.wait.strategy        com.lmax.disruptor.BlockingWaitStrategy
topology.debug  false
topology.builtin.metrics.bucket.size.secs       60
topology.acker.executors        4
task.refresh.poll.secs  10
task.heartbeat.frequency.secs   3
supervisor.worker.timeout.secs  30
supervisor.worker.start.timeout.secs    120
supervisor.slots.ports  [6700 6701 6702 6703]
supervisor.monitor.frequency.secs       3
supervisor.heartbeat.frequency.secs     5
supervisor.enable       true
supervisor.childopts    -Xmx256m -Djava.net.preferIPv4Stack=true
storm.zookeeper.session.timeout 20000
storm.zookeeper.servers ["zookeeper"]
storm.zookeeper.root    /storm
storm.zookeeper.retry.times     5
storm.zookeeper.retry.intervalceiling.millis    30000
storm.zookeeper.retry.interval  1000
storm.zookeeper.port    2181
storm.zookeeper.connection.timeout      15000
storm.thrift.transport  backtype.storm.security.auth.SimpleTransportPlugin
storm.messaging.transport       backtype.storm.messaging.zmq
storm.messaging.netty.server_worker_threads     1
storm.messaging.netty.min_wait_ms       100
storm.messaging.netty.max_wait_ms       1000
storm.messaging.netty.max_retries       30
storm.messaging.netty.client_worker_threads     1
storm.messaging.netty.buffer_size       5242880
storm.local.mode.zmq    false
storm.local.dir /app_local/storm
storm.id        nearline-1-1406570637
storm.cluster.mode      distributed
nimbus.topology.validator       backtype.storm.nimbus.DefaultTopologyValidator
nimbus.thrift.port      6627
nimbus.task.timeout.secs        30
nimbus.task.launch.secs 120
nimbus.supervisor.timeout.secs  60
nimbus.reassign true
nimbus.monitor.freq.secs        10
nimbus.inbox.jar.expiration.secs        3600
nimbus.host     zookeeper
nimbus.file.copy.expiration.secs        600
nimbus.cleanup.inbox.freq.secs  600
nimbus.childopts        -Xmx1024m -Djava.net.preferIPv4Stack=true
logviewer.port  8000
logviewer.childopts     -Xmx128m
logviewer.appender.name A1
java.library.path       /usr/local/lib
drpc.worker.threads     64
drpc.request.timeout.secs       600
drpc.queue.size 128
drpc.port       3772
drpc.invocations.port   3773
drpc.childopts  -Xmx768m
dev.zookeeper.path      /tmp/dev-storm-zookeeper
I have several questions:

  1.  What is complete latency for spout ?  According to the  google, complete 
latency means  The timer is started when the tuple is emitted from the spout 
and it is stopped when the tuple is acked. So it measures the time for the 
entire tuple tree to be completed. As far as I understand, the spout emit a 
tuple, the tuple would be passed to the bolt and be parsed and processed there, 
then the spout would ack it. This means it might be my bolt which spend to much 
time to process the tuple. But according to the above data, it seems the bolt 
’s execution latency and process latency is not high at all compared with 
complete latency of spout, which is contradict to my understanding.
  2.   What is the relationship between the complete latency and failure rate? 
Also what is the relationship between complete latency and the total throuput?
  3.  Is there any hint how should I deal with this high failure rate issue?

Thanks a lot for the help.
FW: why complete latency and failure rate is so high of my spout.

Reply via email to