Crunch, workflow management and user interaction

Stadin, Benjamin Mon, 01 Dec 2014 09:35:21 -0800

I have a mixed bag of requirements, ranging from parallel data processing to 
local file updates (single / same node), and „reactive“ filter interaction. I’m 
undecided what frameworks I should settle on.


It’s probably best explained by an example usage scenario:

 *   A web site user uploads small files (typically 1-200 files, file size 
typically 2-10MB per file)
 *   Files should be converted in parallel and on available nodes. The 
conversion is actually done via native tools, but I consider to use Crunch for 
dynamic parallelization of the conversion according to the number of uploaded 
files. The conversion will likely take between several minutes and a few hours.
 *   The converted files are gathered and stored in a single *SQLite* (!) 
database (containing geometries for rendering). This needs to be done on one 
node only (file lockings etc). You may say I should not use SQLite, but believe 
me I really do =).
 *   Once the SQLite db is ready, a web map server is (re-)configured on the 
very same server as the one where the db job was started, and the user can 
interact with a web application and make small updates to the data set via a 
web map editing UI. This is a temporary service. After a few minutes when user 
interaction is done, the server is "shut down“ (it isn’t really, just the data 
source is remeoved form it and reconfigured).
 *   When the user is done and hit’s the save button, the workflow triggers 
another parallelizable job which does some post-processings on the data

The main two things causing me headache:

 *   I’m not sure how to implement „reactivity“ as it’s called in Haskell 
Arrows with my filters. How should I design a Crunch job as a long-running job 
which accepts input, and in addition runs only on a single node? In Spark one 
could call coalesce(1, true), but in either case I’m not sure how to cleanly 
implement a reactive filter in Crunch or Spark.
 *   Workflow management: In my scenario, there is are n user sessions and each 
can start different workflows in parallel (above outlines just one of the 
workflows). What shall I take to chain my pipes into workflows? Oozie? 
Crunch-Jobs? Could you pint me to an example how to do this?

~Ben

Crunch, workflow management and user interaction

Reply via email to