We use 2 storm topologies, with kafka in between: Kafka --> TopologyA --> Kafka --> TopologyB --> Final output
This allows the two halves of computation to be scaled and maintained independently. --Tom On Tue, Feb 11, 2014 at 2:36 PM, Cheng-Kang Hsieh (Andy) < [email protected]> wrote: > Hi Aniket & Andrian, > > Thank you guys so much for the kind reply! Although the replies don't > directly solve my problem, it has been very rewarding following the code of > redis-storm and Trident. > > I guess storing the intermediate data in an external db (like Cassandra, > as suggested by Andrian) would work, but what if the Bolt that is supposed > to receive the intermediate data fails? In this case, the emitter is also a > Bolt, and does not have the nice ACK mechanism to rely on, so the emitting > Bolt might never know when it should resend the data to the receiving Bolt. > > In other framework like Samza, or Spark Streaming, all the emitted data, > no matter, by a Spout or Bolt is treated as the same way and so benefits > from the same fault tolerance mechanism (they are not as easy to use as > Storm though). For example, in Samza, all the data output of a component > are push to a Kafka queue with the receiving components as the listeners > (see > here<http://samza.incubator.apache.org/learn/documentation/0.7.0/container/state-management.html> > ). > > Conceptually maybe a more general solution for Storm is to make a Bolt > also a Spout which can receive ACKs from the receiving Bolts; however it > seems to violate the assumption of Storm? > > Again I appreciate any advice or suggestion. Thank you! > > Best, > Andy > > > On Fri, Feb 7, 2014 at 9:37 AM, Adrian Mocanu > <[email protected]>wrote: > >> Hi Andy, >> >> I think you can use Trident to persist the results at any point in your >> stream processing. >> >> I believe the way you do that is by using STREAM.persistentAggregate(...) >> >> >> >> Here's an example from >> https://github.com/nathanmarz/storm/wiki/Trident-tutorial >> >> >> >> TridentTopology topology = new TridentTopology(); >> >> TridentState wordCounts = >> >> topology.newStream("spout1", spout) >> >> .each(new Fields("sentence"), new Split(), new Fields("word")) >> >> .groupBy(new Fields("word")) >> >> .persistentAggregate(new MemoryMapState.Factory(), new Count(), >> new Fields("count")) >> >> .parallelismHint(6); >> >> >> >> In this case the counts (re[place counts with whatever operations you are >> doing) are stored in a memory map, but you can make another class that >> saves this intermediate result to a db... at least that's my >> understanding... I >> am currently also learning these things. >> >> I'm currently working on a similar problem and I'm attempting to store >> into Cassandra. Feel free to watch my conversation threads (with Svend and >> Taylor Goetz) >> >> >> >> -A >> >> >> >> *From:* Aniket Alhat [mailto:[email protected]] >> *Sent:* February-06-14 11:57 PM >> *To:* [email protected] >> *Subject:* Re: How to efficiently store the intermediate result of a >> bolt, and so it can be replayed after the crashes? >> >> >> >> I hope this helps >> >> https://github.com/pict2014/storm-redis >> >> On Feb 7, 2014 12:07 AM, "Cheng-Kang Hsieh (Andy)" <[email protected]> >> wrote: >> >> Sorry, I realized that question was badly written. Simply put, my >> question is that is there a recommended way to store the tuples emitted by >> a BOLT so that the tuples can be replayed after crash without repeating the >> process all the way up from the source spout? any advice would be >> appreciated. Thank you! >> >> >> >> Best, >> >> Andy >> >> >> >> On Tue, Feb 4, 2014 at 11:58 AM, Cheng-Kang Hsieh (Andy) < >> [email protected]> wrote: >> >> Hi all, >> >> First of all, Thank Nathan and all the contributors for pulling out such a >> great framework! I am learning a lot, even just reading the discussion >> threads. >> >> I am building a topology that contains one spout along with a chain of >> bolts. (e.g. S -> A -> B, where S is the spout, A, B are bolts.) >> >> When S emits a tuple, the next bolt A will buffer the tuple in a DFS, and >> compute some aggregated values when it has received a sufficient amount of >> data and then emit the aggregation results to the next bolt B. >> >> Here comes my question, is there a recommended way to store the >> intermediate results emitted by a bolt, so that when machine crashes, the >> results can be replayed to the downstreaming bolts (i.e. bolt B)? >> >> One possible solution could be that: Don't keep any intermediate results, >> but resort to the storm's ack framework, so that the raw data will be >> replay from spout S when crash happened. >> >> However, this approach might not be appropriate in my case, as it might >> take pretty long time (like a couple of hours) before bolt A has received >> all the required data and emit the aggregated results, so that it will be >> very expensive for ack framework to keep tracking that many tuples for >> that >> long. >> >> An alternative solution could be: *making bolt A also a spout* and keep >> the >> emitted data in a DFS queue. When a result has been acked, the bolt A >> removes it from the queue. >> >> I am wondering if it is reasonable to make a task both bolt and spout at >> the same time? or if there is any better approach to do so. >> >> Thank you! >> >> -- >> Cheng-Kang Hsieh >> UCLA Computer Science PhD Student >> M: (310) 990-4297 >> A: 3770 Keystone Ave. Apt 402, >> Los Angeles, CA 90034 >> >> >> >> >> >> -- >> Cheng-Kang Hsieh >> UCLA Computer Science PhD Student >> M: (310) 990-4297 >> A: 3770 Keystone Ave. Apt 402, >> Los Angeles, CA 90034 >> >> > > > -- > Cheng-Kang Hsieh > UCLA Computer Science PhD Student > M: (310) 990-4297 > A: 3770 Keystone Ave. Apt 402, > Los Angeles, CA 90034 >
