I would push the incoming tuples into a stack, and have a background thread take the stack, process the top element, and discard the rest. On Jul 5, 2016 10:33 AM, "Marco Nicolini" <[email protected]> wrote:
> Hello guys, > > I'm evaluating using storm for a project at work. > > I've just scratched the surface of storm and I'd like to ask opinions on > how you would tackle a typical issue that pops up in the domain I'm working > in (financial real time market data). > > I have some real time data pushed into storm, (I have written a spout for > that), and I need to group it on certain keys (field grouping seems to take > care of that nicely) do some processing on it and write it to a queue > somewhere (think JMS or other middle ware here...) > > Thing is... the processing takes a long time (in respect to the frequency > of the input data) and I'm not really interested in building a queue of > data to process... (as I would never be able to keep up with it). > What really I'm after is a way to coalesce tuples at bolt level so that > the processing bolt will work on the freshest possible data. In other words > i'm looking for a way to skip tuples that came in while my processor bolt > was working). > > Note that I'm not concerned about parallelizing the processing...as > processing data for the same group in parallel would break the ordering > requisite i have on processed outputs. At most I'm aiming at one "processor > bolt per group" (we can assume the groups numerosity is known in advance). > > I think I can implement this behaviour in the spout: emit a new tuple (for > a given grouping) only when the last emitted tuple (for a given grouping) > has been fully processed (I.e. acknowledge is called on the spout). However > this complicates a bit the spout code (as it would have to group things > already before emitting them) and in general ends up putting too much > complexity on one component... imho. > > If a bolt had the possibility to know when tuple emitted by itself becomes > fully processed (similarly to what acknowledge does for the spout) the > problem would be simpler and complexity-per component would be lower. > > Or the processor bolt could just process things asynchronously (i.e. not > in the thread where execute() is called on the bolt) always picking up the > "latest" data from some local state to the bolt. This state would be > overwritten every time a new fresh tuple is received with execute. > > Or... > > Is there a 'configurable way' to achieve that? > > Any other way you would tackle this? > > Sorry for the "wall of text". Any idea/help/advice is greatly > appreciated... > > Regards, > > Marco >
