Thanks for info.  We/I havent tapped into the metrics (yet?).  Glad you got 
your problem resolved.
    On Wednesday, November 18, 2020, 09:14:21 AM EST, Adam Honen 
<[email protected]> wrote:  
 
 Hi,
I've managed to resolve this, so it's probably best to share what was the issue 
in my case.As mentioned above, we have our own back pressure mechanism.It's all 
controlled from the spout, so I figured out (read: guessed) we're probably 
hitting Storm's limit for the Spout's queue.
After increasing topology.executor.receive.buffer.size further, so it became 
larger than our own limit (50K in this case) the issue is resolved.
Now, as for identifying this more easily next time, I see in the code that this 
configuration is read in WorkerState.mkReceiveQueueMap and sent to the 
constructor of JCQueue where a metrics object is created.Looks like there are 
some really useful metrics reported there.
So for next time I plan on hooking up to these metrics (either via one of the 
built in reporters, or a via new implementation better geared for our needs) 
and reporting some of them to our monitoring system.That should make 
troubleshooting such issues way simpler.
I haven't tested this part yet and it's not documented here: 
https://storm.apache.org/releases/2.2.0/ClusterMetrics.html , but hopefully it 
should still work.


On Tue, Nov 17, 2020 at 4:08 PM Adam Honen <[email protected]> wrote:

Hi,
I'm wondering what sort of metrics, logs, or other indications I can use in 
order to understand why my topology gets stuck after ugrading from Storm 1.1.1 
to Storm 2.2.0.



More in length:
I have a 1.1.1 cluster with 40 workers processing ~400K events/second.It starts 
by reading from Kinesis via the AWS KCL and this is also used to implement our 
own backpressure. That is, when the topology is overloaded with tuples, we stop 
reading from Kinesis until enough progress has been made (we've been able to 
checkpoint).After that, we resume reading.
However, with so many workers we don't really see back pressure being needed 
even when dealing with much larger event rates.
We've now created a similar cluster with storm 2.2.0 and I've tried deploying 
our topology there.However, what happens is that within a couple of seconds, no 
more Kinesis records get read. The topology appears to be just waiting forever 
without processing anything.
I would like to troubleshoot this, but I'm not sure where to collect data 
from.My initial suspicion was that the new back pressure mechanism, now found 
in Storm 2, might have kicked in and that I need to configure it in order to 
resolve this issue. However, this is nothing more than a guess. I'm not sure 
how I can actually prove or disprove this without lots of trail & error.
I've found some documentation about backpressure in the performance tuning 
chapter of the documentation, but that only concentrates on configuration 
parameters and doesn't give information about how to really understand what's 
going on in a running topology.

  

Reply via email to