Hi Andy, I think you can use Trident to persist the results at any point in your stream processing. I believe the way you do that is by using STREAM.persistentAggregate(...)
Here's an example from https://github.com/nathanmarz/storm/wiki/Trident-tutorial TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")) .parallelismHint(6); In this case the counts (re[place counts with whatever operations you are doing) are stored in a memory map, but you can make another class that saves this intermediate result to a db... at least that's my understanding... I am currently also learning these things. I'm currently working on a similar problem and I'm attempting to store into Cassandra. Feel free to watch my conversation threads (with Svend and Taylor Goetz) -A From: Aniket Alhat [mailto:[email protected]] Sent: February-06-14 11:57 PM To: [email protected] Subject: Re: How to efficiently store the intermediate result of a bolt, and so it can be replayed after the crashes? I hope this helps https://github.com/pict2014/storm-redis On Feb 7, 2014 12:07 AM, "Cheng-Kang Hsieh (Andy)" <[email protected]<mailto:[email protected]>> wrote: Sorry, I realized that question was badly written. Simply put, my question is that is there a recommended way to store the tuples emitted by a BOLT so that the tuples can be replayed after crash without repeating the process all the way up from the source spout? any advice would be appreciated. Thank you! Best, Andy On Tue, Feb 4, 2014 at 11:58 AM, Cheng-Kang Hsieh (Andy) <[email protected]<mailto:[email protected]>> wrote: Hi all, First of all, Thank Nathan and all the contributors for pulling out such a great framework! I am learning a lot, even just reading the discussion threads. I am building a topology that contains one spout along with a chain of bolts. (e.g. S -> A -> B, where S is the spout, A, B are bolts.) When S emits a tuple, the next bolt A will buffer the tuple in a DFS, and compute some aggregated values when it has received a sufficient amount of data and then emit the aggregation results to the next bolt B. Here comes my question, is there a recommended way to store the intermediate results emitted by a bolt, so that when machine crashes, the results can be replayed to the downstreaming bolts (i.e. bolt B)? One possible solution could be that: Don't keep any intermediate results, but resort to the storm's ack framework, so that the raw data will be replay from spout S when crash happened. However, this approach might not be appropriate in my case, as it might take pretty long time (like a couple of hours) before bolt A has received all the required data and emit the aggregated results, so that it will be very expensive for ack framework to keep tracking that many tuples for that long. An alternative solution could be: *making bolt A also a spout* and keep the emitted data in a DFS queue. When a result has been acked, the bolt A removes it from the queue. I am wondering if it is reasonable to make a task both bolt and spout at the same time? or if there is any better approach to do so. Thank you! -- Cheng-Kang Hsieh UCLA Computer Science PhD Student M: (310) 990-4297<tel:%28310%29%20990-4297> A: 3770 Keystone Ave. Apt 402, Los Angeles, CA 90034 -- Cheng-Kang Hsieh UCLA Computer Science PhD Student M: (310) 990-4297 A: 3770 Keystone Ave. Apt 402, Los Angeles, CA 90034
