I've seen a ton of examples for Storm so far (I'm a noob). but what I don't
understand is how the spouts do parallelism. Suppose I want to process a
giant file in Storm, each source has to read and process 64MB of the input
file. How can I get each spout to chew on only a portion of the input file
source? Alternatively, suppose I have a twitter fire hose. When twitter's
fire hose produces a tweet, do each of the 10 spouts get the same tweet? The
way I envision spouts in my head now is that simple X threads are created
and each one runs the same code. Is there something special Storm does so
that each spout is processing something different from it's code?

 

Q1: How does each spout know which part of the giant input file to read? 

Q2: How would I initialize a spout to read a certain file? Input params from
the Storm command line?

Q3: how do I know when the input file is completely processed? In the final
bolts' emit logic, can they all communicate to one final bolt and tell them
which piece of the source they've processed, and the final bolt checks off
all the done messages and when done, does - ? How can it signal the topology
owner it's done? 

Q4: Is there a online forum that is easier to use than this email list
server thing, where I can ask and browse questions? This email list server
is so early 1990's, it's shocking.

 

All the online examples I've read about Storm have spouts that produce
essentially random information forever. They are essentially near-useless
examples, to me. Processing a giant file, or processing data from a live
generator of actual data, are much better. I hope I find some decent
examples this weekend.

 

Thanks!

 

Reply via email to