We had the same problem - failures, but no explanation.

Ignoring the root cause for a moment, one thing we did was to simply determine 
where ISpout.fail() is getting called from is to generate an Exception and 
print the stack trace.
E.g.
        Exception ex = new Exception("stacktrace for logging fail() 
invocation");
        log.info("failing msgId {}", msgId, ex);

Within the stack trace, we were able to see that all failures were due to 
timeouts. This should generally be the case if you are not getting any 
exceptions at the worker, and this is one way to prove exactly the reason it is 
getting failed.

Obviously as others pointed out, if you are dealing with FailedException in 
bolts, you can throw ReportFailedException instead to get it reported back to 
ui, but in the case of timeout (or any other case that is causing ISpout.fail() 
) then a stack trace can be a simple way to track how this happened.

It would be nice to be able to generically (not customized in the spout) trace 
from messageId to what state the tuple is in (waiting for ack, timed-out, etc) 
from within the spout, but I’m not sure the best way to do that.

Tyson


On Oct 20, 2014, at 8:36 AM, Simon Cooper 
<[email protected]<mailto:[email protected]>> wrote:

That’s exactly the problem – our IRichBolts are quite complex, and keep hold of 
multiple tuples waiting for other ‘trigger’ tuples before acking several at 
once. With many thousands of tuples flying around the topology, it’s very hard 
to debug issues when one tuple randomly fails – which bolt was holding it 
waiting for a trigger and didn’t ack it in time? Or, if the tuple was failed 
manually, which bolt failed it?

From: Itai Frenkel [mailto:[email protected]]
Sent: 14 October 2014 14:43
To: [email protected]<mailto:[email protected]>
Subject: Re: Finding out why a tuple failed

Simon - Take a look at  BasicBoltExecutor#executor which is an adaptor from 
IBasicBolt to IRichBolt.  All collector.fail is accompanied with 
collector.reportError() if you rethrow exception as ReportedFailedException.

Could you please check that this is the case in your bolts too ?
In IRichBolt you would need to take care of that yourself.​

________________________________
From: Simon Cooper 
<[email protected]<mailto:[email protected]>>
Sent: Tuesday, October 14, 2014 12:48 PM
To: [email protected]<mailto:[email protected]>
Subject: RE: Finding out why a tuple failed

We’re seeing random failures. No exceptions in the logs, just failed tuples at 
the spout with no other information. We think it’s timeouts, but there’s no 
information anywhere as to which bolts in the tuple tree didn’t ack or fail the 
event in time.

From: Itai Frenkel [mailto:[email protected]]
Sent: 14 October 2014 08:32
To: [email protected]<mailto:[email protected]>
Subject: Re: Finding out why a tuple failed

​Let's say you have 10000 tuples processed. And only one of them reported an 
error and that is the same tuple that failed. They you look in Sigmund and see 
the error and you know for sure it relates to the failed tuple.

Now let's consider that out of 10000, half of them failed for different 
reasons, then looking in sigmund will still give you errors, however you would 
not be able to pinpoint it to a specific tuple id.


________________________________
From: Vladi Feigin <[email protected]<mailto:[email protected]>>
Sent: Monday, October 13, 2014 8:50 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Finding out why a tuple failed

@Itai
What do you mean by "other errors" ? Are these the internal Storm errors ,which 
are not reported in the nimbus?
If yes, are they reported in the logs?
Vladi


On Mon, Oct 13, 2014 at 5:05 PM, Itai Frenkel 
<[email protected]<mailto:[email protected]>> wrote:
Assuming each failure in the code is accompanied by collector.reportError(ex) 
(aka BasicBolt) then you would see an exception in nimbus. If there are many 
other errors, then it may not be the exception you are looking for.

To get more fidelity you would need to send all errors to ELK stack (that's 
what we do) and filter by id.

Itai

________________________________
From: Simon Cooper 
<[email protected]<mailto:[email protected]>>
Sent: Monday, October 13, 2014 2:58 PM
To: [email protected]<mailto:[email protected]>
Subject: Finding out why a tuple failed

Is there any possible way, either through logging or programmatically, to find 
out why a tuple failed? If it timed out, which bolts it was waiting for acks 
from in the tuple tree, and if it was explicitly failed, which bolt failed it? 
I’m having a hell of a time trying to debug a complex topology that is not 
acking any of its tuples back at the spout ☹

Thanks,
SimonC


Reply via email to