Hi Sounds to me you need an ETL offline process MR/Shark offline to get the processed data to db. Storm fits the use cases when you have continous data stream and the processing time with a low latency.. On 1 Dec 2014 04:26, "Stadin, Benjamin" < [email protected]> wrote:
> Hi all, > > I need some advise whether Storm is the right tool for my purpose. My > requirements share commonalities with „big data“, workflow coordination and > „reactive“ event driven data processing (as in for example Haskell Arrows), > which doesn’t make it any easier to find the right tool set. > > To explain my needs it’s probably best to give an example scenario: > > - A user uploads small files (typically 1-200 files, file size > typically 2-10MB per file) > - Files should be converted in parallel and on available nodes. The > conversion is actually done via native tools, so there is not so much big > data processing required, but dynamic parallelization (so for example to > split the conversion step into as many conversion tasks as files are > available). The conversion typically takes between several minutes and a > few hours. > - The converted files gathered and are stored in a single database > (containing geometries for rendering) > - Once the db is ready, a web map server is (re-)configured and the > user can make small updates to the data set via a web UI. > - … Some other data processing steps which I leave away for brevity … > - There will be initially only a few concurrent users, but the system > shall be able to scale if needed > > My current thoughts: > > - I should avoid to upload files into the distributed storage during > conversion, but probably should rather have each conversion filter download > the file it is actually converting from a shared place. Other wise it’s bad > for scalability reasons (too many redundant copies of same temporary files > if there are many concurrent users and many cluster nodes). > - Apache Oozie seems an option to chain together my pipes into a > workflow. But is it a good fit with Storm? > - Apache Crunch seems to make it easy to dynamically parallelize tasks > (Oozie itself can’t do this). But I may not need crunch after all if I have > Storm, and it also doesn’t seem to fit to my last problem following. > - The part that causes me the most headache is the user interactive db > update: I consider to use Kafka as message bus to broker between the web UI > and a custom db handler (nb, the db is a SQLite file). Here I see Storm > would serve my purpose better than Spark (Streaming) since it should have > immediate update responsiveness and the handler is probably best > implemented as a long running continuing task. But does Storm allow to > create such long running tasks dynamically, so that when another (web) user > starts a new task a new long-running task is created? Also, is it possible > to identify a running task, so that a long running task can be bound to a > session (db handler working on local db updates, until task done)? > > > ~Ben >
