An M/R job is a one-shot job, in itself. Making it iterative is what a higher-level controller does, by running it several times and pointing it at the right input. That bit isn't part of M/R. So I don't think you would accomplish this goal by implementing something *under* the M/R API.
M/Rs still get written but I think most people serious about it are already using higher-level APIs like Apache Crunch, or Cascading. For those who haven't seen it, Crunch's abstraction bears a lot of resemblance to the Spark model -- handles on remote collections. So, *the reverse* of this suggestion (i.e. Spark-ish API on M/R) is basically Crunch, or Scrunch if you like Scala. I know Josh Wills has put work into getting Crunch to operate *on top of Spark* even. That might be of interest to the original idea of getting a possibly more familiar API, for some current Hadoop devs, running on top of Spark. (Josh tells me it also enables a few tricks that are hard in Spark.) -- Sean Owen | Director, Data Science | London On Sat, Feb 1, 2014 at 11:57 PM, nileshc <[email protected]> wrote: > This might seem like a silly question, so please bear with me. I'm not sure > about it myself, just would like to know if you think it's utterly > unfeasible or not, and if it's at all worth doing. > > Does anyone feel like it'll be a good idea to build some sort of a library > that allows us to write code for Spark using the usual bloated Hadoop API? > This is for the people who want to run their existing MapReduce code (with > NIL or minimal adjustments) with Spark to take advantage of its speed and > its better support for iterative workflows. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Hadoop-MapReduce-on-Spark-tp1110.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.
