Hi Tyson,
This implementation with ISpout.fail() stack trace is possible only if a
topology runs with acks mode , isn't it?
Vladi

On Mon, Oct 20, 2014 at 7:35 PM, Tyson Norris <[email protected]> wrote:

>  We had the same problem - failures, but no explanation.
>
>  Ignoring the root cause for a moment, one thing we did was to simply
> determine where ISpout.fail() is getting called from is to generate an
> Exception and print the stack trace.
> E.g.
>         Exception ex = new Exception("stacktrace for logging fail()
> invocation");
>         log.info("failing msgId {}", msgId, ex);
>
>  Within the stack trace, we were able to see that all failures were due
> to timeouts. This should generally be the case if you are not getting any
> exceptions at the worker, and this is one way to prove exactly the reason
> it is getting failed.
>
>  Obviously as others pointed out, if you are dealing with FailedException
> in bolts, you can throw ReportFailedException instead to get it reported
> back to ui, but in the case of timeout (or any other case that is causing
> ISpout.fail() ) then a stack trace can be a simple way to track how this
> happened.
>
>  It would be nice to be able to generically (not customized in the spout)
> trace from messageId to what state the tuple is in (waiting for ack,
> timed-out, etc) from within the spout, but I’m not sure the best way to do
> that.
>
>  Tyson
>
>
>  On Oct 20, 2014, at 8:36 AM, Simon Cooper <
> [email protected]> wrote:
>
>   That’s exactly the problem – our IRichBolts are quite complex, and keep
> hold of multiple tuples waiting for other ‘trigger’ tuples before acking
> several at once. With many thousands of tuples flying around the topology,
> it’s very hard to debug issues when one tuple randomly fails – which bolt
> was holding it waiting for a trigger and didn’t ack it in time? Or, if the
> tuple was failed manually, which bolt failed it?
>
>   *From:* Itai Frenkel [mailto:[email protected] <[email protected]>]
> *Sent:* 14 October 2014 14:43
> *To:* [email protected]
> *Subject:* Re: Finding out why a tuple failed
>
>  Simon - Take a look at  BasicBoltExecutor#executor which is an adaptor
> from IBasicBolt to IRichBolt.  All collector.fail is accompanied with
> collector.reportError() if you rethrow exception as ReportedFailedException.
>
>  Could you please check that this is the case in your bolts too ?
>  In IRichBolt you would need to take care of that yourself.​
>
>   ------------------------------
>   *From:* Simon Cooper <[email protected]>
> *Sent:* Tuesday, October 14, 2014 12:48 PM
> *To:* [email protected]
> *Subject:* RE: Finding out why a tuple failed
>
>   We’re seeing random failures. No exceptions in the logs, just failed
> tuples at the spout with no other information. We think it’s timeouts, but
> there’s no information anywhere as to which bolts in the tuple tree didn’t
> ack or fail the event in time.
>
>   *From:* Itai Frenkel [mailto:[email protected] <[email protected]>]
> *Sent:* 14 October 2014 08:32
> *To:* [email protected]
> *Subject:* Re: Finding out why a tuple failed
>
>  ​Let's say you have 10000 tuples processed. And only one of them
> reported an error and that is the same tuple that failed. They you look in
> Sigmund and see the error and you know for sure it relates to the failed
> tuple.
>
>  Now let's consider that out of 10000, half of them failed for different
> reasons, then looking in sigmund will still give you errors, however you
> would not be able to pinpoint it to a specific tuple id.
>
>
>   ------------------------------
>   *From:* Vladi Feigin <[email protected]>
> *Sent:* Monday, October 13, 2014 8:50 PM
> *To:* [email protected]
> *Subject:* Re: Finding out why a tuple failed
>
>    @Itai
>  What do you mean by "other errors" ? Are these the internal Storm errors
> ,which are not reported in the nimbus?
>   If yes, are they reported in the logs?
>   Vladi
>
>
>  On Mon, Oct 13, 2014 at 5:05 PM, Itai Frenkel <[email protected]> wrote:
>  Assuming each failure in the code is accompanied by
> collector.reportError(ex) (aka BasicBolt) then you would see an exception
> in nimbus. If there are many other errors, then it may not be the exception
> you are looking for.
>
>  To get more fidelity you would need to send all errors to ELK stack
> (that's what we do) and filter by id.
>
>  Itai
>
>   ------------------------------
>   *From:* Simon Cooper <[email protected]>
> *Sent:* Monday, October 13, 2014 2:58 PM
> *To:* [email protected]
> *Subject:* Finding out why a tuple failed
>
>   Is there *any* possible way, either through logging or
> programmatically, to find out why a tuple failed? If it timed out, which
> bolts it was waiting for acks from in the tuple tree, and if it was
> explicitly failed, which bolt failed it? I’m having a hell of a time trying
> to debug a complex topology that is not acking any of its tuples back at
> the spout L
>
>  Thanks,
>  SimonC
>
>
>
>

Reply via email to