Hey Ben, Have you had a look at Spark Streaming? It seems like a better choice for the "reactive" part of the application. In the last release of Crunch, I added a bunch of "SFunctions" that allow you to re-use logic you write using Spark's Java APIs with Crunch if it makes sense for your use case:
http://crunch.apache.org/apidocs/0.11.0/org/apache/crunch/fn/SFunctions.html My suspicion, based on what I read in the above, is that you're more gated on CPU than IO for most of the steps in your workflow-- is that true? If so, I'd be inclined to recommend an app architecture that was built on something like golang over the JVM-based Hadoop/Spark world. Best, Josh On Mon, Dec 1, 2014 at 9:33 AM, Stadin, Benjamin < [email protected]> wrote: > I have a mixed bag of requirements, ranging from parallel data processing > to local file updates (single / same node), and „reactive“ filter > interaction. I’m undecided what frameworks I should settle on. > > It’s probably best explained by an example usage scenario: > > - A web site user uploads small files (typically 1-200 files, file > size typically 2-10MB per file) > - Files should be converted in parallel and on available nodes. The > conversion is actually done via native tools, but I consider to use Crunch > for dynamic parallelization of the conversion according to the number of > uploaded files. The conversion will likely take between several minutes and > a few hours. > - The converted files are gathered and stored in a single *SQLite* (!) > database (containing geometries for rendering). This needs to be done on > one node only (file lockings etc). You may say I should not use SQLite, but > believe me I really do =). > - Once the SQLite db is ready, a web map server is (re-)configured on > the very same server as the one where the db job was started, and the user > can interact with a web application and make small updates to the data set > via a web map editing UI. This is a temporary service. After a few minutes > when user interaction is done, the server is "shut down“ (it isn’t really, > just the data source is remeoved form it and reconfigured). > - When the user is done and hit’s the save button, the workflow > triggers another parallelizable job which does some post-processings on the > data > > The main two things causing me headache: > > - I’m not sure how to implement „reactivity“ as it’s called in Haskell > Arrows with my filters. How should I design a Crunch job as a long-running > job which accepts input, and in addition runs only on a single node? In > Spark one could call coalesce(1, true), but in either case I’m not sure how > to cleanly implement a reactive filter in Crunch or Spark. > - Workflow management: In my scenario, there is are n user sessions > and each can start different workflows in parallel (above outlines just one > of the workflows). What shall I take to chain my pipes into workflows? > Oozie? Crunch-Jobs? Could you pint me to an example how to do this? > > ~Ben > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
