Most of the use cases fall into two categories... 1. Pre-processing over TB/PB scale data where the data size is larger than total RAM available on the cluster...Due to maturity of map-reduce, a DAG based job scheduling framework running on top of Map Reduce (Scalding/Cascading and Scrunch/Crunch) gives you the power to write code in a higher abstraction as Sean mentioned. Since anyway you are shuffing results on disk, here I don't see much difference between Map-Reduce and Spark pipelines
2. Running iterative algorithms over features: Here the data has been cleaned from 1 and you are running algorithmic analysis, perhaps going to convergence of some sort...Map-Reduce paradigm was not meant for such tasks...Even for distributed graphs and streaming data the same analogy holds. Here Spark starts to shine as you can take the DAG and mark parts of the DAG or the whole DAG to be cached in-memory. Scalding/Scrunch can also come up with the api for in-memory caching of parts of the DAG but it is not available yet. Basically to sum up, I think we will need both the tools for different use-cases till they are merged (?) by a higher abstraction layer (hopefully scalding/scrunch ! On Sat, Feb 1, 2014 at 4:43 PM, Sean Owen <[email protected]> wrote: > An M/R job is a one-shot job, in itself. Making it iterative is what a > higher-level controller does, by running it several times and pointing > it at the right input. That bit isn't part of M/R. So I don't think > you would accomplish this goal by implementing something *under* the > M/R API. > > M/Rs still get written but I think most people serious about it are > already using higher-level APIs like Apache Crunch, or Cascading. > > For those who haven't seen it, Crunch's abstraction bears a lot of > resemblance to the Spark model -- handles on remote collections. So, > *the reverse* of this suggestion (i.e. Spark-ish API on M/R) is > basically Crunch, or Scrunch if you like Scala. > > I know Josh Wills has put work into getting Crunch to operate *on top > of Spark* even. That might be of interest to the original idea of > getting a possibly more familiar API, for some current Hadoop devs, > running on top of Spark. (Josh tells me it also enables a few tricks > that are hard in Spark.) > > > > > > > -- > Sean Owen | Director, Data Science | London > > > On Sat, Feb 1, 2014 at 11:57 PM, nileshc <[email protected]> wrote: > > This might seem like a silly question, so please bear with me. I'm not > sure > > about it myself, just would like to know if you think it's utterly > > unfeasible or not, and if it's at all worth doing. > > > > Does anyone feel like it'll be a good idea to build some sort of a > library > > that allows us to write code for Spark using the usual bloated Hadoop > API? > > This is for the people who want to run their existing MapReduce code > (with > > NIL or minimal adjustments) with Spark to take advantage of its speed and > > its better support for iterative workflows. > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Hadoop-MapReduce-on-Spark-tp1110.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. >
