There is no simple way to do this across an entire cluster. We are not reporting it in the metrics that go to the UI so it is not that simple to aggregate them for more then one topology. They are available in the topology specific metrics.
http://storm.apache.org/releases/1.0.4/Metrics.html Describes some of this. Sadly all of the metrics we support are not listed there. I would suggest you install the logging metrics consumer and start looking around at what we have there. The ones I think you want are __receive __send and __transport. Each of these are queue metrics. Some of the fields you should look at are population, overflow, and sojourn_time_ms. population is the number of entires in the queue itself. If the queue fills up there is an overflow where there may be more entires. sojourn_time_ms is an estimation of the number of ms it will take for a tuple to flow through the queue. In my experience this is a very noisy number and is not always that accurate, but it can give you an idea if there is a problem (aka if the number is large). - Bobby On Wed, Oct 18, 2017 at 2:31 PM Tom Raney <[email protected]> wrote: > Is there a good way to measure how contended a cluster is in terms of > inbound/outbound queues? > > I'm using 1.0.2 and have noticed that at times tuples flowing through a > topology slow down considerably. > > Load for each of the 5 nodes in the cluster is low and network doesn't > appear bottlenecked. Sometimes, if I redeploy or re-balance the topology, > throughput increases dramatically for a day or so. > > I'm using topology.max.spout.pending set to 30 with 8 spouts feeding 40 > "writer" bolts. The capacity metric for the busiest bolt is around .780, > which seems to indicate that they aren't the bottleneck. > > topology.message.timeout.secs is set to 120 seconds, but I'm not seeing > failures. > > Additionally, I'm using tic tuples to flush the accumulated data at each > bolt to the database every 5 minutes. Between those cycles, the bolt > accumulates aggregated data and only writes if cache misses occur. But, > the cache hit rate is almost always 100%. > > -Tom >
