It’s fairly easy to take your existing Mapper and Reducer objects and call them within Spark. First, you can use SparkContext.hadoopRDD to read a file with any Hadoop InputFormat (you can even pass it the JobConf you would’ve created in Hadoop). Then use mapPartitions to iterate through each partition and pass it to your mapper, and reduceByKey or groupByKey to go to the reducer.
We’ve investigated offering the MapReduce API directly, and while it’s possible, one problem is that a lot of MapReduce code isn’t thread-safe. Hadoop runs each task in a separate JVM, while Spark can run multiple tasks concurrently in the same JVM, so some existing code in the jobs we tried porting this way broke. But if your code is thread-safe, the approach mentioned above should work pretty well. Matei On Feb 1, 2014, at 3:57 PM, nileshc <[email protected]> wrote: > This might seem like a silly question, so please bear with me. I'm not sure > about it myself, just would like to know if you think it's utterly > unfeasible or not, and if it's at all worth doing. > > Does anyone feel like it'll be a good idea to build some sort of a library > that allows us to write code for Spark using the usual bloated Hadoop API? > This is for the people who want to run their existing MapReduce code (with > NIL or minimal adjustments) with Spark to take advantage of its speed and > its better support for iterative workflows. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Hadoop-MapReduce-on-Spark-tp1110.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.
