Re: Hadoop MapReduce on Spark

Debasish Das Sat, 01 Feb 2014 18:17:36 -0800

Most of the use cases fall into two categories...

1. Pre-processing over TB/PB scale data where the data size is larger than
total RAM available on the cluster...Due to maturity of map-reduce, a DAG
based job scheduling framework running on top of Map Reduce
(Scalding/Cascading and Scrunch/Crunch) gives you the power to write code
in a higher abstraction as Sean mentioned. Since anyway you are shuffing
results on disk, here I don't see much difference between Map-Reduce and
Spark pipelines


2. Running iterative algorithms over features: Here the data has been
cleaned from 1 and you are running algorithmic analysis, perhaps going to
convergence of some sort...Map-Reduce paradigm was not meant for such
tasks...Even for distributed graphs and streaming data the same analogy
holds. Here Spark starts to shine as you can take the DAG and mark parts of
the DAG or the whole DAG to be cached in-memory. Scalding/Scrunch can also
come up with the api for in-memory caching of parts of the DAG but it is
not available yet.

Basically to sum up, I think we will need both the tools for different
use-cases till they are merged (?) by a higher abstraction layer (hopefully
scalding/scrunch !

On Sat, Feb 1, 2014 at 4:43 PM, Sean Owen <[email protected]> wrote:

> An M/R job is a one-shot job, in itself. Making it iterative is what a
> higher-level controller does, by running it several times and pointing
> it at the right input. That bit isn't part of M/R. So I don't think
> you would accomplish this goal by implementing something *under* the
> M/R API.
>
> M/Rs still get written but I think most people serious about it are
> already using higher-level APIs like Apache Crunch, or Cascading.
>
> For those who haven't seen it, Crunch's abstraction bears a lot of
> resemblance to the Spark model -- handles on remote collections. So,
> *the reverse* of this suggestion (i.e. Spark-ish API on M/R) is
> basically Crunch, or Scrunch if you like Scala.
>
> I know Josh Wills has put work into getting Crunch to operate *on top
> of Spark* even. That might be of interest to the original idea of
> getting a possibly more familiar API, for some current Hadoop devs,
> running on top of Spark. (Josh tells me it also enables a few tricks
> that are hard in Spark.)
>
>
>
>
>
>
> --
> Sean Owen | Director, Data Science | London
>
>
> On Sat, Feb 1, 2014 at 11:57 PM, nileshc <[email protected]> wrote:
> > This might seem like a silly question, so please bear with me. I'm not
> sure
> > about it myself, just would like to know if you think it's utterly
> > unfeasible or not, and if it's at all worth doing.
> >
> > Does anyone feel like it'll be a good idea to build some sort of a
> library
> > that allows us to write code for Spark using the usual bloated Hadoop
> API?
> > This is for the people who want to run their existing MapReduce code
> (with
> > NIL or minimal adjustments) with Spark to take advantage of its speed and
> > its better support for iterative workflows.
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Hadoop-MapReduce-on-Spark-tp1110.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Hadoop MapReduce on Spark

Reply via email to