Hi Tyson, This implementation with ISpout.fail() stack trace is possible only if a topology runs with acks mode , isn't it? Vladi
On Mon, Oct 20, 2014 at 7:35 PM, Tyson Norris <[email protected]> wrote: > We had the same problem - failures, but no explanation. > > Ignoring the root cause for a moment, one thing we did was to simply > determine where ISpout.fail() is getting called from is to generate an > Exception and print the stack trace. > E.g. > Exception ex = new Exception("stacktrace for logging fail() > invocation"); > log.info("failing msgId {}", msgId, ex); > > Within the stack trace, we were able to see that all failures were due > to timeouts. This should generally be the case if you are not getting any > exceptions at the worker, and this is one way to prove exactly the reason > it is getting failed. > > Obviously as others pointed out, if you are dealing with FailedException > in bolts, you can throw ReportFailedException instead to get it reported > back to ui, but in the case of timeout (or any other case that is causing > ISpout.fail() ) then a stack trace can be a simple way to track how this > happened. > > It would be nice to be able to generically (not customized in the spout) > trace from messageId to what state the tuple is in (waiting for ack, > timed-out, etc) from within the spout, but I’m not sure the best way to do > that. > > Tyson > > > On Oct 20, 2014, at 8:36 AM, Simon Cooper < > [email protected]> wrote: > > That’s exactly the problem – our IRichBolts are quite complex, and keep > hold of multiple tuples waiting for other ‘trigger’ tuples before acking > several at once. With many thousands of tuples flying around the topology, > it’s very hard to debug issues when one tuple randomly fails – which bolt > was holding it waiting for a trigger and didn’t ack it in time? Or, if the > tuple was failed manually, which bolt failed it? > > *From:* Itai Frenkel [mailto:[email protected] <[email protected]>] > *Sent:* 14 October 2014 14:43 > *To:* [email protected] > *Subject:* Re: Finding out why a tuple failed > > Simon - Take a look at BasicBoltExecutor#executor which is an adaptor > from IBasicBolt to IRichBolt. All collector.fail is accompanied with > collector.reportError() if you rethrow exception as ReportedFailedException. > > Could you please check that this is the case in your bolts too ? > In IRichBolt you would need to take care of that yourself. > > ------------------------------ > *From:* Simon Cooper <[email protected]> > *Sent:* Tuesday, October 14, 2014 12:48 PM > *To:* [email protected] > *Subject:* RE: Finding out why a tuple failed > > We’re seeing random failures. No exceptions in the logs, just failed > tuples at the spout with no other information. We think it’s timeouts, but > there’s no information anywhere as to which bolts in the tuple tree didn’t > ack or fail the event in time. > > *From:* Itai Frenkel [mailto:[email protected] <[email protected]>] > *Sent:* 14 October 2014 08:32 > *To:* [email protected] > *Subject:* Re: Finding out why a tuple failed > > Let's say you have 10000 tuples processed. And only one of them > reported an error and that is the same tuple that failed. They you look in > Sigmund and see the error and you know for sure it relates to the failed > tuple. > > Now let's consider that out of 10000, half of them failed for different > reasons, then looking in sigmund will still give you errors, however you > would not be able to pinpoint it to a specific tuple id. > > > ------------------------------ > *From:* Vladi Feigin <[email protected]> > *Sent:* Monday, October 13, 2014 8:50 PM > *To:* [email protected] > *Subject:* Re: Finding out why a tuple failed > > @Itai > What do you mean by "other errors" ? Are these the internal Storm errors > ,which are not reported in the nimbus? > If yes, are they reported in the logs? > Vladi > > > On Mon, Oct 13, 2014 at 5:05 PM, Itai Frenkel <[email protected]> wrote: > Assuming each failure in the code is accompanied by > collector.reportError(ex) (aka BasicBolt) then you would see an exception > in nimbus. If there are many other errors, then it may not be the exception > you are looking for. > > To get more fidelity you would need to send all errors to ELK stack > (that's what we do) and filter by id. > > Itai > > ------------------------------ > *From:* Simon Cooper <[email protected]> > *Sent:* Monday, October 13, 2014 2:58 PM > *To:* [email protected] > *Subject:* Finding out why a tuple failed > > Is there *any* possible way, either through logging or > programmatically, to find out why a tuple failed? If it timed out, which > bolts it was waiting for acks from in the tuple tree, and if it was > explicitly failed, which bolt failed it? I’m having a hell of a time trying > to debug a complex topology that is not acking any of its tuples back at > the spout L > > Thanks, > SimonC > > > >
