Hi
Sounds to me you need an ETL offline process MR/Shark offline to get the
processed data to db.
Storm fits the use cases when you have continous data stream and the
processing time with a low latency..
 On 1 Dec 2014 04:26, "Stadin, Benjamin" <
[email protected]> wrote:

> Hi all,
>
> I need some advise whether Storm is the right tool for my purpose. My
> requirements share commonalities with „big data“, workflow coordination and
> „reactive“ event driven data processing (as in for example Haskell Arrows),
> which doesn’t make it any easier to find the right tool set.
>
> To explain my needs it’s probably best to give an example scenario:
>
>    - A user uploads small files (typically 1-200 files, file size
>    typically 2-10MB per file)
>    - Files should be converted in parallel and on available nodes. The
>    conversion is actually done via native tools, so there is not so much big
>    data processing required, but dynamic parallelization (so for example to
>    split the conversion step into as many conversion tasks as files are
>    available). The conversion typically takes between several minutes and a
>    few hours.
>    - The converted files gathered and are stored in a single database
>    (containing geometries for rendering)
>    - Once the db is ready, a web map server is (re-)configured and the
>    user can make small updates to the data set via a web UI.
>    - … Some other data processing steps which I leave away for brevity …
>    - There will be initially only a few concurrent users, but the system
>    shall be able to scale if needed
>
> My current thoughts:
>
>    - I should avoid to upload files into the distributed storage during
>    conversion, but probably should rather have each conversion filter download
>    the file it is actually converting from a shared place. Other wise it’s bad
>    for scalability reasons (too many redundant copies of same temporary files
>    if there are many concurrent users and many cluster nodes).
>    - Apache Oozie seems an option to chain together my pipes into a
>    workflow. But is it a good fit with Storm?
>    - Apache Crunch seems to make it easy to dynamically parallelize tasks
>    (Oozie itself can’t do this). But I may not need crunch after all if I have
>    Storm, and it also doesn’t seem to fit to my last problem following.
>    - The part that causes me the most headache is the user interactive db
>    update: I consider to use Kafka as message bus to broker between the web UI
>    and a custom db handler (nb, the db is a SQLite file). Here I see Storm
>    would serve my purpose better than Spark (Streaming) since it should have
>    immediate update responsiveness and the handler is probably best
>    implemented as a long running continuing task. But does Storm allow to
>    create such long running tasks dynamically, so that when another (web) user
>    starts a new task a new long-running task is created? Also, is it possible
>    to identify a running task, so that a long running task can be bound to a
>    session (db handler working on local db updates, until task done)?
>
>
> ~Ben
>

Reply via email to