Re: Hadoop MapReduce on Spark

Matei Zaharia Sat, 01 Feb 2014 18:39:36 -0800

It’s fairly easy to take your existing Mapper and Reducer objects and call them 
within Spark. First, you can use SparkContext.hadoopRDD to read a file with any 
Hadoop InputFormat (you can even pass it the JobConf you would’ve created in 
Hadoop). Then use mapPartitions to iterate through each partition and pass it 
to your mapper, and reduceByKey or groupByKey to go to the reducer.

We’ve investigated offering the MapReduce API directly, and while it’s 
possible, one problem is that a lot of MapReduce code isn’t thread-safe. Hadoop 
runs each task in a separate JVM, while Spark can run multiple tasks 
concurrently in the same JVM, so some existing code in the jobs we tried 
porting this way broke. But if your code is thread-safe, the approach mentioned 
above should work pretty well.

Matei

On Feb 1, 2014, at 3:57 PM, nileshc <[email protected]> wrote:

> This might seem like a silly question, so please bear with me. I'm not sure
> about it myself, just would like to know if you think it's utterly
> unfeasible or not, and if it's at all worth doing.
> 
> Does anyone feel like it'll be a good idea to build some sort of a library
> that allows us to write code for Spark using the usual bloated Hadoop API?
> This is for the people who want to run their existing MapReduce code (with
> NIL or minimal adjustments) with Spark to take advantage of its speed and
> its better support for iterative workflows.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Hadoop-MapReduce-on-Spark-tp1110.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Hadoop MapReduce on Spark

Reply via email to