Hello guys, I'm evaluating using storm for a project at work.
I've just scratched the surface of storm and I'd like to ask opinions on how you would tackle a typical issue that pops up in the domain I'm working in (financial real time market data). I have some real time data pushed into storm, (I have written a spout for that), and I need to group it on certain keys (field grouping seems to take care of that nicely) do some processing on it and write it to a queue somewhere (think JMS or other middle ware here...) Thing is... the processing takes a long time (in respect to the frequency of the input data) and I'm not really interested in building a queue of data to process... (as I would never be able to keep up with it). What really I'm after is a way to coalesce tuples at bolt level so that the processing bolt will work on the freshest possible data. In other words i'm looking for a way to skip tuples that came in while my processor bolt was working). Note that I'm not concerned about parallelizing the processing...as processing data for the same group in parallel would break the ordering requisite i have on processed outputs. At most I'm aiming at one "processor bolt per group" (we can assume the groups numerosity is known in advance). I think I can implement this behaviour in the spout: emit a new tuple (for a given grouping) only when the last emitted tuple (for a given grouping) has been fully processed (I.e. acknowledge is called on the spout). However this complicates a bit the spout code (as it would have to group things already before emitting them) and in general ends up putting too much complexity on one component... imho. If a bolt had the possibility to know when tuple emitted by itself becomes fully processed (similarly to what acknowledge does for the spout) the problem would be simpler and complexity-per component would be lower. Or the processor bolt could just process things asynchronously (i.e. not in the thread where execute() is called on the bolt) always picking up the "latest" data from some local state to the bolt. This state would be overwritten every time a new fresh tuple is received with execute. Or... Is there a 'configurable way' to achieve that? Any other way you would tackle this? Sorry for the "wall of text". Any idea/help/advice is greatly appreciated... Regards, Marco
