My guess is that the issue could be in marking a tuple tree as complete. If that does not happen within "topology.message.timeout.secs", that tuple will be cleared from the acker's rotating map and marked as failure. The bigger the tuple tree the more is the time the tuple has to wait to be marked complete - until all bolts ack the tuple. By simplifying my topology, my tuple tree is very much simplified (S->B1). As soon as B1 acks the tuple, it is marked as complete.
On Wed, Apr 16, 2014 at 11:02 PM, Michael Chang <[email protected]> wrote: > Hi Srinath, > > Thanks for the update. I'm not quite sure why your solution would remedy > the solution (seems like there are more tuples in flight in the system), > but it's great that you could provide a working setup. > > Michael > > > On Tue, Apr 15, 2014 at 11:33 PM, Srinath C <[email protected]> wrote: > >> Hi Michael, >> I experimented a bit by making changes to my topology and now I'm >> seeing consistent acking with little failures on the spout. >> >> My topology had a spout S emitting tuples and two BaseRichBolts B1 >> (store tuple) and B2 (aggregate tuple) were receiving tuples from the >> default stream of the spout. I made S emit each tuple twice on different >> streams - one with a message Id for reliable delivery streamed out to B1 >> and another stream without a message Id streamed out to B2. >> >> With this change there is a significant amount of improvement to the >> number of failed tuples. Its almost down to 1-2% of the total now. Even the >> failures occurred only at peak rates of tuple rate. >> >> I'd like to try out more experiments to figure what was wrong with my >> earlier topology, but I'm time constrained right now. >> Hope it helps and let me know if you figure out anything.. >> >> Regards, >> Srinath. >> >> >> >> On Tue, Apr 15, 2014 at 9:33 PM, Michael Chang <[email protected]>wrote: >> >>> Hey Srinath, >>> >>> Yep, our ackers don't seem overloaded at all, and the behavior you are >>> seeing sounds exactly like what we are seeing here. >>> >>> >>> On Tue, Apr 15, 2014 at 6:47 AM, Srinath C <[email protected]> wrote: >>> >>>> I have been seeing this behaviour on 0.9.0.1 running on (aws & >>>> non-vpc). All tuples get a fail() on the spout and I'm not sure why. Even a >>>> simple case of spoutA -> boltB is showing up this behaviour after a >>>> continuous flow of tuples. >>>> >>>> So far increasing ACKer count hasn't helped. All I could figure out was >>>> the fail() is called from backtype.storm.utils.RotatingMap#rotate >>>> which I believe means that the topology.max.spout.pending time has >>>> exceeded and the tuple is not yet marked as completed. I'm pretty sure >>>> there are no exceptions in handling the tuples. >>>> >>>> Will update if I find any insights. >>>> >>>> >>>> >>>> On Tue, Apr 15, 2014 at 3:07 PM, 朱春来 <[email protected]> wrote: >>>> >>>>> Hi Michael Chang, >>>>> >>>>> >>>>> >>>>> Did you ack or fail tuple in the bolt timely and please check >>>>> the bolt processing speed of a tuple. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *发件人:* Michael Chang [mailto:[email protected]] >>>>> *发送时间:* 2014年4月15日 16:41 >>>>> *收件人:* [email protected] >>>>> *主题:* Storm Topology Halts >>>>> >>>>> >>>>> >>>>> [email protected] all, >>>>> >>>>> >>>>> >>>>> Issue: >>>>> >>>>> >>>>> >>>>> We are having issues with stuck topologies. When submitted and >>>>> started, our topology will start processing for a while, then completely >>>>> halt for around topology.max.spout.pending seconds, after which it seems >>>>> that all the in-flight tuples are failed. This cycle will loop >>>>> continuously. Has anybody seen this issue / have suggestions about how to >>>>> debug? >>>>> >>>>> >>>>> >>>>> Environment: >>>>> >>>>> >>>>> >>>>> We are running a storm cluster in AWS, non-vpc. We’re running 0.9.1 >>>>> but using guava 16.0.1 and httpclient 4.3.1 in the lib path. We were >>>>> originally trying this with the regular netty transport, and reverting >>>>> back >>>>> to the zmq transport seemed to help at first, but now we’re seeing the >>>>> same >>>>> behavior as well, so it seems like a deeper rooted problem than just the >>>>> transport. >>>>> >>>>> >>>>> >>>>> Any help would be appreciated. >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> >>>>> >>>>> Michael >>>>> >>>> >>>> >>> >> >
