Re: FW: why complete latency and failure rate is so high of my spout.

Nathan Leung Wed, 30 Jul 2014 09:27:47 -0700

Complete latency can be higher if you have a lot of data pending from the
spout (your topology.max.spout.pending is not set), however given the low
number of tuples I wouldn't suspect this to be the case.  What is the
layout of your topology?  Is it circular in any way?



On Wed, Jul 30, 2014 at 12:11 PM, Wei, Xin <[email protected]> wrote:

>
>
>
>
>   Hi There,
>
>  I’m new to storm. I set up a storm cluster with 3 machines. One machine
> I use it as master, I start up zookeeper and nimbus server on it, another 2
> machines I use them as supervisor.   I got a very high rate of failure of
> my spout and I can not figure out why. Please see the detailed  statistics
> as below.
>  opology stats   Window Emitted Transferred Complete latency (ms) Acked
> Failed   10m 0s
> <http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637?window=600>
> 32 32 124856.430 7 6  3h 0m 0s
> <http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637?window=10800>
> 15780 15780 108286.484 2885 6435  1d 0h 0m 0s
> <http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637?window=86400>
> 15780 15780 108286.484 2885 6435  All time
> <http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637?window=%3Aall-time>
> 15780 15780 108286.484 2885 6435   Spouts (All time)   Id Executors Tasks
> Emitted Transferred Complete latency (ms) Acked Failed Last error
> JMS_QUEUE_SPOUT
> <http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637/component/JMS_QUEUE_SPOUT>
> 2 2 9320 9320 108286.484 2885 6435    Bolts (All time)
>   Id Executors Tasks Emitted Transferred Capacity (last 10m) Execute
> latency (ms) Executed Process latency (ms) Acked Failed Last error
> AGGREGATOR_BOLT
> <http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637/component/AGGREGATOR_BOLT>
> 8 8 3230 3230 0.000 61.529 3230 58.975 3230 0   MESSAGEFILTER_BOLT
> <http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637/component/MESSAGEFILTER_BOLT>
> 8 8 3230 3230 0.000 23.207 9320 18.054 9320 0   OFFER_GENERATOR_BOLT
> <http://pppdc9prd470.corp.intuit.net:8080/topology/nearline-1-1406570637/component/OFFER_GENERATOR_BOLT>
> 8 8 0 0 0.000 38.267 3230 34.693 3230 0
> For the spout , only 2885 tuples got acked and 6435 got failed. The
> complete latency is horribly hight.
>
>  Here is my storm configuration :
>
>   Topology Configuration   Key Value   zmq.threads 1  zmq.linger.millis
> 5000  zmq.hwm 0  worker.heartbeat.frequency.secs 1  worker.childopts -Xmx768m
> -Djava.net.preferIPv4Stack=false -DNEARLINE_DATA_ENV=dev
> -DNEARLINE_APP_ENV=dev -DNEARLINE_QUEUES_ENV=dev
> -Dauthfilter.appcred.default.encrypt.file=/home/xwei/FP_AppCred_Encrypt.txt
> -Dauthfilter.appcred.default.passphrase.file=/home/xwei/FP_AppCred_Passphrase.txt
> ui.port 8080  ui.childopts -Xmx768m  transactional.zookeeper.servers
> transactional.zookeeper.root /transactional  transactional.zookeeper.port
> topology.workers 4  topology.worker.shared.thread.pool.size 4
> topology.worker.childopts   topology.tuple.serializer
> backtype.storm.serialization.types.ListDelegateSerializer
> topology.trident.batch.emit.interval.millis 500
> topology.transfer.buffer.size 32  topology.tick.tuple.freq.secs
> topology.tasks   topology.stats.sample.rate 1
> topology.state.synchronization.timeout.secs 60
> topology.spout.wait.strategy backtype.storm.spout.SleepSpoutWaitStrategy
> topology.sleep.spout.wait.strategy.time.ms 1
> topology.skip.missing.kryo.registrations false
> topology.receiver.buffer.size 8  topology.optimize true  topology.name
> nearline  topology.message.timeout.secs 30  topology.max.task.parallelism
> topology.max.spout.pending   topology.max.error.report.per.interval 5
> topology.kryo.register   topology.kryo.factory
> backtype.storm.serialization.DefaultKryoFactory  topology.kryo.decorators
> []  topology.fall.back.on.java.serialization true
> topology.executor.send.buffer.size 16384
> topology.executor.receive.buffer.size 16384
> topology.error.throttle.interval.secs 10  topology.enable.message.timeouts
> true  topology.disruptor.wait.strategy
> com.lmax.disruptor.BlockingWaitStrategy  topology.debug false
> topology.builtin.metrics.bucket.size.secs 60  topology.acker.executors 4
> task.refresh.poll.secs 10  task.heartbeat.frequency.secs 3
> supervisor.worker.timeout.secs 30  supervisor.worker.start.timeout.secs
> 120  supervisor.slots.ports [6700 6701 6702 6703]
> supervisor.monitor.frequency.secs 3  supervisor.heartbeat.frequency.secs 5
> supervisor.enable true  supervisor.childopts -Xmx256m
> -Djava.net.preferIPv4Stack=true  storm.zookeeper.session.timeout 20000
> storm.zookeeper.servers ["zookeeper"]  storm.zookeeper.root /storm
> storm.zookeeper.retry.times 5
> storm.zookeeper.retry.intervalceiling.millis 30000
> storm.zookeeper.retry.interval 1000  storm.zookeeper.port 2181
> storm.zookeeper.connection.timeout 15000  storm.thrift.transport
> backtype.storm.security.auth.SimpleTransportPlugin
> storm.messaging.transport backtype.storm.messaging.zmq
> storm.messaging.netty.server_worker_threads 1
> storm.messaging.netty.min_wait_ms 100  storm.messaging.netty.max_wait_ms
> 1000  storm.messaging.netty.max_retries 30
> storm.messaging.netty.client_worker_threads 1
> storm.messaging.netty.buffer_size 5242880  storm.local.mode.zmq false
> storm.local.dir /app_local/storm  storm.id nearline-1-1406570637
> storm.cluster.mode distributed  nimbus.topology.validator
> backtype.storm.nimbus.DefaultTopologyValidator  nimbus.thrift.port 6627
> nimbus.task.timeout.secs 30  nimbus.task.launch.secs 120
> nimbus.supervisor.timeout.secs 60  nimbus.reassign true
> nimbus.monitor.freq.secs 10  nimbus.inbox.jar.expiration.secs 3600
> nimbus.host zookeeper  nimbus.file.copy.expiration.secs 600
> nimbus.cleanup.inbox.freq.secs 600  nimbus.childopts -Xmx1024m
> -Djava.net.preferIPv4Stack=true  logviewer.port 8000  logviewer.childopts
> -Xmx128m  logviewer.appender.name A1  java.library.path /usr/local/lib
> drpc.worker.threads 64  drpc.request.timeout.secs 600  drpc.queue.size 128
> drpc.port 3772  drpc.invocations.port 3773  drpc.childopts -Xmx768m
> dev.zookeeper.path /tmp/dev-storm-zookeeper
> I have several questions:
>
>    1. What is complete latency for spout ?  According to the  google,
>    complete latency means  *The timer is started when the tuple is
>    emitted from the spout and it is stopped when the tuple is acked. So it
>    measures the time for the entire tuple tree to be completed.* As far
>    as I understand, the spout emit a tuple, the tuple would be passed to the
>    bolt and be parsed and processed there, then the spout would ack it. This
>    means it might be my bolt which spend to much time to process the tuple.
>    But according to the above data, it seems the bolt ’s execution latency and
>    process latency is not high at all compared with complete latency of spout,
>    which is contradict to my understanding.
>    2.  What is the relationship between the complete latency and failure
>    rate? Also what is the relationship between complete latency and the total
>    throuput?
>    3. Is there any hint how should I deal with this high failure rate
>    issue?
>
> Thanks a lot for the help.
>
>
>
>
>

Re: FW: why complete latency and failure rate is so high of my spout.

Reply via email to