Sorry, I've meant Spark. On Mon, Dec 1, 2014 at 11:38 AM, Stadin, Benjamin < [email protected]> wrote:
> Thanks for your response. > Shark doesn’t seem to be something I want / need. The custom data handler > is performance critical, file based (SQLite file) and already highly > optimized (e.g. File sync is off, giving. And this db is associated to a > single user sessions and should not be replicated but rather be a local > temporary source existing only on the executing node – otherwise > replicating these files will become a bottle neck. But maybe this is still > possible to configure with Shark? > > > Von: Vladi Feigin <[email protected]> > Antworten an: "[email protected]" <[email protected]> > Datum: Montag, 1. Dezember 2014 06:16 > An: "[email protected]" <[email protected]> > Betreff: Re: Is Storm the right tool for me? > > Hi > Sounds to me you need an ETL offline process MR/Shark offline to get the > processed data to db. > Storm fits the use cases when you have continous data stream and the > processing time with a low latency.. > On 1 Dec 2014 04:26, "Stadin, Benjamin" < > [email protected]> wrote: > >> Hi all, >> >> I need some advise whether Storm is the right tool for my purpose. My >> requirements share commonalities with „big data“, workflow coordination and >> „reactive“ event driven data processing (as in for example Haskell Arrows), >> which doesn’t make it any easier to find the right tool set. >> >> To explain my needs it’s probably best to give an example scenario: >> >> - A user uploads small files (typically 1-200 files, file size >> typically 2-10MB per file) >> - Files should be converted in parallel and on available nodes. The >> conversion is actually done via native tools, so there is not so much big >> data processing required, but dynamic parallelization (so for example to >> split the conversion step into as many conversion tasks as files are >> available). The conversion typically takes between several minutes and a >> few hours. >> - The converted files gathered and are stored in a single database >> (containing geometries for rendering) >> - Once the db is ready, a web map server is (re-)configured and the >> user can make small updates to the data set via a web UI. >> - … Some other data processing steps which I leave away for brevity … >> - There will be initially only a few concurrent users, but the system >> shall be able to scale if needed >> >> My current thoughts: >> >> - I should avoid to upload files into the distributed storage during >> conversion, but probably should rather have each conversion filter >> download >> the file it is actually converting from a shared place. Other wise it’s >> bad >> for scalability reasons (too many redundant copies of same temporary files >> if there are many concurrent users and many cluster nodes). >> - Apache Oozie seems an option to chain together my pipes into a >> workflow. But is it a good fit with Storm? >> - Apache Crunch seems to make it easy to dynamically parallelize >> tasks (Oozie itself can’t do this). But I may not need crunch after all if >> I have Storm, and it also doesn’t seem to fit to my last problem >> following. >> - The part that causes me the most headache is the user interactive >> db update: I consider to use Kafka as message bus to broker between the >> web >> UI and a custom db handler (nb, the db is a SQLite file). Here I see Storm >> would serve my purpose better than Spark (Streaming) since it should have >> immediate update responsiveness and the handler is probably best >> implemented as a long running continuing task. But does Storm allow to >> create such long running tasks dynamically, so that when another (web) >> user >> starts a new task a new long-running task is created? Also, is it possible >> to identify a running task, so that a long running task can be bound to a >> session (db handler working on local db updates, until task done)? >> >> >> ~Ben >> >
