I took thread dumps of the worker where the graphite consumer bolt executor
was running but I didn't see any BLOCKED threads or anything out of the
ordinary.  This is the thread dump for the graphite metrics consumer bolt:

"Thread-23-__metricscom.verisign.storm.metrics.GraphiteMetricsConsumer" #56
prio=5 os_prio=0 tid=0x00007f0b8555c800 nid=0x9a2 waiting on condition
[0x00007f0abaeed000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at
backtype.storm.daemon.executor$fn__5694$fn__5707.invoke(executor.clj:713)
        at backtype.storm.util$async_loop$fn__545.invoke(util.clj:477)
        at clojure.lang.AFn.run(AFn.java:22)
        at java.lang.Thread.run(Thread.java:745)

Would a "stuck" bolt on some other worker JVM have the same effect?


On Fri, Apr 15, 2016 at 2:10 PM, Abhishek Agarwal <[email protected]>
wrote:

> You might want to check the thread dump and verify if some bolt is stuck
> somewhere
>
> Excuse typos
> On Apr 15, 2016 11:08 PM, "Kevin Conaway" <[email protected]>
> wrote:
>
>> Was the bolt really "stuck" though given that the failure was at the
>> spout level (because the spout couldn't connect to the Kafka broker)?
>>
>> Additionally, we restarted the Kafka broker and it seemed like the spout
>> was able to reconnect but we never saw messages from through on the metric
>> consumer until we killed and restarted the topology.
>>
>> On Fri, Apr 15, 2016 at 1:31 PM, Abhishek Agarwal <[email protected]>
>> wrote:
>>
>>> Kevin,
>>> That would explain it. A stuck bolt will stall the whole topology.
>>> MetricConsumer runs as a bolt so it will be blocked as well
>>>
>>> Excuse typos
>>> On Apr 15, 2016 10:29 PM, "Kevin Conaway" <[email protected]>
>>> wrote:
>>>
>>>> Two more data points on this:
>>>>
>>>> 1.) We are registering the graphite MetricsConsumer on our Topology
>>>> Config, not globally in storm.yaml.  I don't know if this makes a
>>>> difference.
>>>>
>>>> 2.) We re-ran another test last night and it ran fine for about 6 hours
>>>> until the Kafka brokers ran out of disk space (oops) which halted the
>>>> test.  This exact time also coincided with when the Graphite instance
>>>> stopped receiving metrics from Storm.  Given that we weren't processing any
>>>> tuples while storm was down, I understand why we didn't get those metrics
>>>> but shouldn't the __system metrics (like heap size, gc time) still have
>>>> been sent?
>>>>
>>>> On Thu, Apr 14, 2016 at 10:09 PM, Kevin Conaway <
>>>> [email protected]> wrote:
>>>>
>>>>> Thank you for taking the time to respond.
>>>>>
>>>>> In my bolt I am registering 3 custom metrics (each a ReducedMetric to
>>>>> track the latency of individual operations in the bolt).  The metric
>>>>> interval for each is the same as TOPOLOGY_BUILTIN_METRICS_BUCKET_SIZE_SECS
>>>>> which we have set at 60s
>>>>>
>>>>> The topology did not hang completely but it did degrade severely.
>>>>> Without metrics it was hard to tell but it looked like some of the tasks
>>>>> for certain kafka partitions either stopped emitting tuples or never got
>>>>> acknowledgements for the tuples they did emit.  Some tuples were 
>>>>> definitely
>>>>> making it through though because data was continuously being inserted in 
>>>>> to
>>>>> Cassandra.  After I killed and resubmitted the topology, there were still
>>>>> messages left over in the topic but only for certain partitions.
>>>>>
>>>>> What queue configuration are you looking for?
>>>>>
>>>>> I don't believe that the case was that the graphite metrics consumer
>>>>> wasn't "keeping up".  In storm UI, the processing latency was very low for
>>>>> that pseudo-bolt, as was the capacity.  Storm UI just showed that no 
>>>>> tuples
>>>>> were being delivered to the bolt.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Thu, Apr 14, 2016 at 9:00 PM, Jungtaek Lim <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Kevin,
>>>>>>
>>>>>> Do you register custom metrics? If then how long / vary is their
>>>>>> intervals?
>>>>>> Did your topology not working completely? (I mean did all tuples
>>>>>> become failing after that time?)
>>>>>> And could you share your queue configuration?
>>>>>>
>>>>>> And you can replace storm-graphite to LoggingMetricsConsumer and see
>>>>>> it helps. If changing consumer resolves the issue, we can guess
>>>>>> storm-graphite cannot keep up the metrics.
>>>>>>
>>>>>> Btw, I'm addressing metrics consumer issues (asynchronous, filter).
>>>>>> You can track the progress here:
>>>>>> https://issues.apache.org/jira/browse/STORM-1699
>>>>>>
>>>>>> I'm afraid they may be not ported to 0.10.x, but asynchronous
>>>>>> metrics consumer bolt
>>>>>> <https://issues.apache.org/jira/browse/STORM-1698> is a simple patch
>>>>>> so you can apply and build custom 0.10.0, and give it a try.
>>>>>>
>>>>>> Hope this helps.
>>>>>>
>>>>>> Thanks,
>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>
>>>>>>
>>>>>> 2016년 4월 14일 (목) 오후 11:06, Denis DEBARBIEUX <[email protected]>님이
>>>>>> 작성:
>>>>>>
>>>>>>> Hi Kevin,
>>>>>>>
>>>>>>> I have a similar issue with storm 0.9.6 (see the following topic
>>>>>>> https://mail-archives.apache.org/mod_mbox/storm-user/201603.mbox/browser
>>>>>>> ).
>>>>>>>
>>>>>>> It is still open. So, please, keep me informed on your progress.
>>>>>>>
>>>>>>> Denis
>>>>>>>
>>>>>>>
>>>>>>> Le 14/04/2016 15:54, Kevin Conaway a écrit :
>>>>>>>
>>>>>>> We are using Storm 0.10 with the following configuration:
>>>>>>>
>>>>>>>    - 1 Nimbus node
>>>>>>>    - 6 Supervisor nodes, each with 2 worker slots.  Each supervisor
>>>>>>>    has 8 cores.
>>>>>>>
>>>>>>>
>>>>>>> Our topology has a KafkaSpout that forwards to a bolt where we
>>>>>>> transform the message and insert it in to Cassandra.  Our topic has 50
>>>>>>> partitions so we have configured the number of executors/tasks for the
>>>>>>> KafkaSpout to be 50.  Our bolt has 150 executors/tasks.
>>>>>>>
>>>>>>> We have also added the storm-graphite metrics consumer (
>>>>>>> <https://github.com/verisign/storm-graphite>
>>>>>>> https://github.com/verisign/storm-graphite) to our topology so that
>>>>>>> storms metrics are sent to our graphite cluster.
>>>>>>>
>>>>>>> Yesterday we were running a 2000 tuple/sec load test and everything
>>>>>>> was fine for a few hours until we noticed that we were no longer 
>>>>>>> receiving
>>>>>>> metrics from Storm in graphite.
>>>>>>>
>>>>>>> I verified that its not a connectivity issue between the Storm and
>>>>>>> Graphite.  Looking in Storm UI,
>>>>>>> the __metricscom.verisign.storm.metrics.GraphiteMetricsConsumer hadn't
>>>>>>> received a single tuple in the prior 10 minute or 3 hour window.
>>>>>>>
>>>>>>> Since the metrics consumer bolt was assigned to one executor, I took
>>>>>>> thread dumps of that JVM.  I saw the following stack trace for the 
>>>>>>> metrics
>>>>>>> consumer thread:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------
>>>>>>> [image: Avast logo]
>>>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>>>>>>
>>>>>>> L'absence de virus dans ce courrier électronique a été vérifiée par
>>>>>>> le logiciel antivirus Avast.
>>>>>>> www.avast.com
>>>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Kevin Conaway
>>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>>> https://github.com/kevinconaway
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Kevin Conaway
>>>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>>>> https://github.com/kevinconaway
>>>>
>>>
>>
>>
>> --
>> Kevin Conaway
>> http://www.linkedin.com/pub/kevin-conaway/7/107/580/
>> https://github.com/kevinconaway
>>
>


-- 
Kevin Conaway
http://www.linkedin.com/pub/kevin-conaway/7/107/580/
https://github.com/kevinconaway

Reply via email to