Kevin, That would explain it. A stuck bolt will stall the whole topology. MetricConsumer runs as a bolt so it will be blocked as well
Excuse typos On Apr 15, 2016 10:29 PM, "Kevin Conaway" <[email protected]> wrote: > Two more data points on this: > > 1.) We are registering the graphite MetricsConsumer on our Topology > Config, not globally in storm.yaml. I don't know if this makes a > difference. > > 2.) We re-ran another test last night and it ran fine for about 6 hours > until the Kafka brokers ran out of disk space (oops) which halted the > test. This exact time also coincided with when the Graphite instance > stopped receiving metrics from Storm. Given that we weren't processing any > tuples while storm was down, I understand why we didn't get those metrics > but shouldn't the __system metrics (like heap size, gc time) still have > been sent? > > On Thu, Apr 14, 2016 at 10:09 PM, Kevin Conaway <[email protected] > > wrote: > >> Thank you for taking the time to respond. >> >> In my bolt I am registering 3 custom metrics (each a ReducedMetric to >> track the latency of individual operations in the bolt). The metric >> interval for each is the same as TOPOLOGY_BUILTIN_METRICS_BUCKET_SIZE_SECS >> which we have set at 60s >> >> The topology did not hang completely but it did degrade severely. >> Without metrics it was hard to tell but it looked like some of the tasks >> for certain kafka partitions either stopped emitting tuples or never got >> acknowledgements for the tuples they did emit. Some tuples were definitely >> making it through though because data was continuously being inserted in to >> Cassandra. After I killed and resubmitted the topology, there were still >> messages left over in the topic but only for certain partitions. >> >> What queue configuration are you looking for? >> >> I don't believe that the case was that the graphite metrics consumer >> wasn't "keeping up". In storm UI, the processing latency was very low for >> that pseudo-bolt, as was the capacity. Storm UI just showed that no tuples >> were being delivered to the bolt. >> >> Thanks! >> >> On Thu, Apr 14, 2016 at 9:00 PM, Jungtaek Lim <[email protected]> wrote: >> >>> Kevin, >>> >>> Do you register custom metrics? If then how long / vary is their >>> intervals? >>> Did your topology not working completely? (I mean did all tuples become >>> failing after that time?) >>> And could you share your queue configuration? >>> >>> And you can replace storm-graphite to LoggingMetricsConsumer and see it >>> helps. If changing consumer resolves the issue, we can guess storm-graphite >>> cannot keep up the metrics. >>> >>> Btw, I'm addressing metrics consumer issues (asynchronous, filter). >>> You can track the progress here: >>> https://issues.apache.org/jira/browse/STORM-1699 >>> >>> I'm afraid they may be not ported to 0.10.x, but asynchronous metrics >>> consumer bolt <https://issues.apache.org/jira/browse/STORM-1698> is a >>> simple patch so you can apply and build custom 0.10.0, and give it a try. >>> >>> Hope this helps. >>> >>> Thanks, >>> Jungtaek Lim (HeartSaVioR) >>> >>> >>> 2016년 4월 14일 (목) 오후 11:06, Denis DEBARBIEUX <[email protected]>님이 >>> 작성: >>> >>>> Hi Kevin, >>>> >>>> I have a similar issue with storm 0.9.6 (see the following topic >>>> https://mail-archives.apache.org/mod_mbox/storm-user/201603.mbox/browser >>>> ). >>>> >>>> It is still open. So, please, keep me informed on your progress. >>>> >>>> Denis >>>> >>>> >>>> Le 14/04/2016 15:54, Kevin Conaway a écrit : >>>> >>>> We are using Storm 0.10 with the following configuration: >>>> >>>> - 1 Nimbus node >>>> - 6 Supervisor nodes, each with 2 worker slots. Each supervisor >>>> has 8 cores. >>>> >>>> >>>> Our topology has a KafkaSpout that forwards to a bolt where we >>>> transform the message and insert it in to Cassandra. Our topic has 50 >>>> partitions so we have configured the number of executors/tasks for the >>>> KafkaSpout to be 50. Our bolt has 150 executors/tasks. >>>> >>>> We have also added the storm-graphite metrics consumer ( >>>> <https://github.com/verisign/storm-graphite> >>>> https://github.com/verisign/storm-graphite) to our topology so that >>>> storms metrics are sent to our graphite cluster. >>>> >>>> Yesterday we were running a 2000 tuple/sec load test and everything was >>>> fine for a few hours until we noticed that we were no longer receiving >>>> metrics from Storm in graphite. >>>> >>>> I verified that its not a connectivity issue between the Storm and >>>> Graphite. Looking in Storm UI, >>>> the __metricscom.verisign.storm.metrics.GraphiteMetricsConsumer hadn't >>>> received a single tuple in the prior 10 minute or 3 hour window. >>>> >>>> Since the metrics consumer bolt was assigned to one executor, I took >>>> thread dumps of that JVM. I saw the following stack trace for the metrics >>>> consumer thread: >>>> >>>> >>>> >>>> >>>> ------------------------------ >>>> [image: Avast logo] >>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> >>>> >>>> L'absence de virus dans ce courrier électronique a été vérifiée par le >>>> logiciel antivirus Avast. >>>> www.avast.com >>>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> >>>> >>>> >> >> >> -- >> Kevin Conaway >> http://www.linkedin.com/pub/kevin-conaway/7/107/580/ >> https://github.com/kevinconaway >> > > > > -- > Kevin Conaway > http://www.linkedin.com/pub/kevin-conaway/7/107/580/ > https://github.com/kevinconaway >
