Re: Is Storm the right tool for me?

Michael Rose Mon, 01 Dec 2014 10:00:43 -0800

"The conversion typically takes between several minutes and a few hours."
the variability here doesn't lend its self well to Storm. Generally, your
work units will need to be roughly equal otherwise tuning and reliability
tracking will be difficult.


Michael Rose (@Xorlev <https://twitter.com/xorlev>)
Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
[email protected]

On Mon, Dec 1, 2014 at 6:03 AM, Vladi Feigin <[email protected]> wrote:

> Sorry, I've meant Spark.
>
> On Mon, Dec 1, 2014 at 11:38 AM, Stadin, Benjamin <
> [email protected]> wrote:
>
>> Thanks for your response.
>> Shark doesn’t seem to be something I want / need. The custom data handler
>> is performance critical, file based (SQLite file) and already highly
>> optimized (e.g. File sync is off, giving. And this db is associated to a
>> single user sessions and should not be replicated but rather be a local
>> temporary source existing only on the executing node – otherwise
>> replicating these files will become a bottle neck. But maybe this is still
>> possible to configure with Shark?
>>
>>
>> Von: Vladi Feigin <[email protected]>
>> Antworten an: "[email protected]" <[email protected]>
>> Datum: Montag, 1. Dezember 2014 06:16
>> An: "[email protected]" <[email protected]>
>> Betreff: Re: Is Storm the right tool for me?
>>
>> Hi
>> Sounds to me you need an ETL offline process MR/Shark offline to get the
>> processed data to db.
>> Storm fits the use cases when you have continous data stream and the
>> processing time with a low latency..
>> On 1 Dec 2014 04:26, "Stadin, Benjamin" <
>> [email protected]> wrote:
>>
>>> Hi all,
>>>
>>> I need some advise whether Storm is the right tool for my purpose. My
>>> requirements share commonalities with „big data“, workflow coordination and
>>> „reactive“ event driven data processing (as in for example Haskell Arrows),
>>> which doesn’t make it any easier to find the right tool set.
>>>
>>> To explain my needs it’s probably best to give an example scenario:
>>>
>>>    - A user uploads small files (typically 1-200 files, file size
>>>    typically 2-10MB per file)
>>>    - Files should be converted in parallel and on available nodes. The
>>>    conversion is actually done via native tools, so there is not so much big
>>>    data processing required, but dynamic parallelization (so for example to
>>>    split the conversion step into as many conversion tasks as files are
>>>    available). The conversion typically takes between several minutes and a
>>>    few hours.
>>>    - The converted files gathered and are stored in a single database
>>>    (containing geometries for rendering)
>>>    - Once the db is ready, a web map server is (re-)configured and the
>>>    user can make small updates to the data set via a web UI.
>>>    - … Some other data processing steps which I leave away for brevity …
>>>    - There will be initially only a few concurrent users, but the
>>>    system shall be able to scale if needed
>>>
>>> My current thoughts:
>>>
>>>    - I should avoid to upload files into the distributed storage during
>>>    conversion, but probably should rather have each conversion filter 
>>> download
>>>    the file it is actually converting from a shared place. Other wise it’s 
>>> bad
>>>    for scalability reasons (too many redundant copies of same temporary 
>>> files
>>>    if there are many concurrent users and many cluster nodes).
>>>    - Apache Oozie seems an option to chain together my pipes into a
>>>    workflow. But is it a good fit with Storm?
>>>    - Apache Crunch seems to make it easy to dynamically parallelize
>>>    tasks (Oozie itself can’t do this). But I may not need crunch after all 
>>> if
>>>    I have Storm, and it also doesn’t seem to fit to my last problem 
>>> following.
>>>    - The part that causes me the most headache is the user interactive
>>>    db update: I consider to use Kafka as message bus to broker between the 
>>> web
>>>    UI and a custom db handler (nb, the db is a SQLite file). Here I see 
>>> Storm
>>>    would serve my purpose better than Spark (Streaming) since it should have
>>>    immediate update responsiveness and the handler is probably best
>>>    implemented as a long running continuing task. But does Storm allow to
>>>    create such long running tasks dynamically, so that when another (web) 
>>> user
>>>    starts a new task a new long-running task is created? Also, is it 
>>> possible
>>>    to identify a running task, so that a long running task can be bound to a
>>>    session (db handler working on local db updates, until task done)?
>>>
>>>
>>> ~Ben
>>>
>>
>

Re: Is Storm the right tool for me?

Reply via email to