We rolled out any code changes.
We have the same code base running in a different environment with different 
scales and it works fine.

The current environment has a higher load.
As you wrote, the DEBUG log didn’t shed any light as for the reason for the 
failures.

I read somewhere that a failed tuple is replayed until it is acked.
Is it possible that we have some failing events that loop infinitely because 
they keep failing?
If so, is there an easy way to check that out? Instead of adding cache and 
metrics and implementing our own mechanism to count unique failures.

I’m surprised there is no straight forward way of knowing what is causing these 
failures


From: Stig Rohde Døssing <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Monday, October 16, 2017 at 9:09 PM
To: "[email protected]" <[email protected]>
Subject: Re: Topology high number of failures

Thanks for elaborating. Since you say that this has started recently you might 
want to check if any recent code changes have added paths to your bolts where 
tuples are not acked. You should check this for the bolts that receive tuples 
that are still anchored to the tuple tree, i.e. all bolts up until the point 
where you emit without anchors.

I'm curious why you have acking enabled if you don't anchor your tuples?
Regarding the log, I was mistaken. It doesn't look like there's much logging 
happening when a tuple fails, beyond the line I linked earlier. If your 
components throw exceptions they should go in the worker (topology) log, at 
least in recent versions of Storm, I'm not certain about 0.9.4.
The most likely reason for these failures is hitting the tuple timeout. Your 
complete latency is low, but I want to say that that number doesn't account for 
tuples that fail due to timeout (again, not certain about this). If there's 
some part of your code that's receiving anchored tuples and not acking them, it 
would probably look like this.
Another potential explanation could be this issue 
https://issues.apache.org/jira/browse/STORM-1750. If one of your executors die 
while the Zookeeper connection is unstable, the executor thread may die and not 
recover. I haven't verified if this issue occurs on 0.9.4, but it looks like it 
might from looking at the executor code.

2017-10-16 11:23 GMT+02:00 Yovav Waichman 
<[email protected]<mailto:[email protected]>>:
Thanks for replying.
Sorry for not being clear, acking is enabled, we don’t use anchoring when 
emitting events from our bolts.

I’ve changed our log level to DEBUG.
Are there any error messages I should look for in particular?

Thanks again for your help

From: Stig Rohde Døssing <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Sunday, October 15, 2017 at 5:25 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Topology high number of failures

Hi,
Could you elaborate on your configuration? You say you aren't using anchoring, 
but if you're getting tuple failures (and a complete latency) then acking must 
be enabled. Is acking enabled for the spout, and then you don't use an anchor 
when emitting tuples from the bolts in the topology, or what do you mean?

0.9.4 should be able to log when a tuple fails in the spout, here 
https://github.com/apache/storm/blob/v0.9.4/storm-core/src/clj/backtype/storm/daemon/executor.clj#L371.
 I believe you need to set the "backtype.storm.daemon.executor" logger level to 
DEBUG in the logback config.

2017-10-15 11:10 GMT+02:00 Yovav Waichman 
<[email protected]<mailto:[email protected]>>:
Hi,

We are using Apache Storm for a couple of years, and everything was fine till 
now.
For our spout we are using “storm-kafka-0.9.4.jar”.

Lately, our “Failed” number of events has increased dramatically, and currently 
almost 20% of our total events are marked as Failed.

We tried investigating our Topology logs, but we came up empty handed.
Moreover, our spout complete latency is 25.996 ms.
 We suspected that our db is under a heavy load, so we increased our message 
timeout t0 60 and even 300 seconds, but that had no affect on the number of 
failures.

Lowering our max pending value has produced a negative result.
At some point, since we are not using anchoring, we thought about adding 
anchoring, but we saw that the KafkaSpout handles failures by replaying them, 
so we were not sure whether to add it or not.

It would be helpful if you can direct us as to where we can find in Storm logs 
the reason for these failures, if there’s an exception which is not caught, 
maybe a time out, since we are a bit blind at the moment.

We would appreciate any help with that.

Thanks in advance,
Yovav


Reply via email to