That’s exactly the problem – our IRichBolts are quite complex, and keep hold of multiple tuples waiting for other ‘trigger’ tuples before acking several at once. With many thousands of tuples flying around the topology, it’s very hard to debug issues when one tuple randomly fails – which bolt was holding it waiting for a trigger and didn’t ack it in time? Or, if the tuple was failed manually, which bolt failed it?
From: Itai Frenkel [mailto:[email protected]] Sent: 14 October 2014 14:43 To: [email protected] Subject: Re: Finding out why a tuple failed Simon - Take a look at BasicBoltExecutor#executor which is an adaptor from IBasicBolt to IRichBolt. All collector.fail is accompanied with collector.reportError() if you rethrow exception as ReportedFailedException. Could you please check that this is the case in your bolts too ? In IRichBolt you would need to take care of that yourself. ________________________________ From: Simon Cooper <[email protected]<mailto:[email protected]>> Sent: Tuesday, October 14, 2014 12:48 PM To: [email protected]<mailto:[email protected]> Subject: RE: Finding out why a tuple failed We’re seeing random failures. No exceptions in the logs, just failed tuples at the spout with no other information. We think it’s timeouts, but there’s no information anywhere as to which bolts in the tuple tree didn’t ack or fail the event in time. From: Itai Frenkel [mailto:[email protected]] Sent: 14 October 2014 08:32 To: [email protected]<mailto:[email protected]> Subject: Re: Finding out why a tuple failed Let's say you have 10000 tuples processed. And only one of them reported an error and that is the same tuple that failed. They you look in Sigmund and see the error and you know for sure it relates to the failed tuple. Now let's consider that out of 10000, half of them failed for different reasons, then looking in sigmund will still give you errors, however you would not be able to pinpoint it to a specific tuple id. ________________________________ From: Vladi Feigin <[email protected]<mailto:[email protected]>> Sent: Monday, October 13, 2014 8:50 PM To: [email protected]<mailto:[email protected]> Subject: Re: Finding out why a tuple failed @Itai What do you mean by "other errors" ? Are these the internal Storm errors ,which are not reported in the nimbus? If yes, are they reported in the logs? Vladi On Mon, Oct 13, 2014 at 5:05 PM, Itai Frenkel <[email protected]<mailto:[email protected]>> wrote: Assuming each failure in the code is accompanied by collector.reportError(ex) (aka BasicBolt) then you would see an exception in nimbus. If there are many other errors, then it may not be the exception you are looking for. To get more fidelity you would need to send all errors to ELK stack (that's what we do) and filter by id. Itai ________________________________ From: Simon Cooper <[email protected]<mailto:[email protected]>> Sent: Monday, October 13, 2014 2:58 PM To: [email protected]<mailto:[email protected]> Subject: Finding out why a tuple failed Is there any possible way, either through logging or programmatically, to find out why a tuple failed? If it timed out, which bolts it was waiting for acks from in the tuple tree, and if it was explicitly failed, which bolt failed it? I’m having a hell of a time trying to debug a complex topology that is not acking any of its tuples back at the spout ☹ Thanks, SimonC
