I have a mixed bag of requirements, ranging from parallel data processing to local file updates (single / same node), and „reactive“ filter interaction. I’m undecided what frameworks I should settle on.
It’s probably best explained by an example usage scenario: * A web site user uploads small files (typically 1-200 files, file size typically 2-10MB per file) * Files should be converted in parallel and on available nodes. The conversion is actually done via native tools, but I consider to use Crunch for dynamic parallelization of the conversion according to the number of uploaded files. The conversion will likely take between several minutes and a few hours. * The converted files are gathered and stored in a single *SQLite* (!) database (containing geometries for rendering). This needs to be done on one node only (file lockings etc). You may say I should not use SQLite, but believe me I really do =). * Once the SQLite db is ready, a web map server is (re-)configured on the very same server as the one where the db job was started, and the user can interact with a web application and make small updates to the data set via a web map editing UI. This is a temporary service. After a few minutes when user interaction is done, the server is "shut down“ (it isn’t really, just the data source is remeoved form it and reconfigured). * When the user is done and hit’s the save button, the workflow triggers another parallelizable job which does some post-processings on the data The main two things causing me headache: * I’m not sure how to implement „reactivity“ as it’s called in Haskell Arrows with my filters. How should I design a Crunch job as a long-running job which accepts input, and in addition runs only on a single node? In Spark one could call coalesce(1, true), but in either case I’m not sure how to cleanly implement a reactive filter in Crunch or Spark. * Workflow management: In my scenario, there is are n user sessions and each can start different workflows in parallel (above outlines just one of the workflows). What shall I take to chain my pipes into workflows? Oozie? Crunch-Jobs? Could you pint me to an example how to do this? ~Ben
