This being a very broad topic, a discussion can quickly get subjective.
I'll try not to deviate from my experiences and observations to keep this
thread useful to those looking for answers.

I have used Hadoop MR (with Hive, MR Java apis, Cascading and Scalding) as
well as Spark (since v 0.6) in Scala. I learnt Scala for using Spark. My
observations are below.

Spark and Hadoop MR:
1. There doesn't have to be a dichotomy between Hadoop ecosystem and Spark
since Spark is a part of it.

2. Spark or Hadoop MR, there is no getting away from learning how
partitioning, input splits, and shuffle process work. In order to optimize
performance, troubleshoot and design software one must know these. I
recommend reading first 6-7 chapters of "Hadoop The definitive Guide" book
to develop initial understanding. Indeed knowing a couple of divide and
conquer algorithms is a pre-requisite and I assume everyone on this mailing
list is very familiar :)

3. Having used a lot of different APIs and layers of abstraction for Hadoop
MR, my experience progressing from MR Java API --> Cascading --> Scalding
is that each new API looks "simpler" than the previous one. However, Spark
API and abstraction has been simplest. Not only for me but those who I have
seen start with Hadoop MR or Spark first. It is easiest to get started and
become productive with Spark with the exception of Hive for those who are
already familiar with SQL. Spark's ease of use is critical for teams
starting out with Big Data.

4. It is also extremely simple to chain multi-stage jobs in Spark, you do
it without even realizing by operating over RDDs. In Hadoop MR, one has to
handle it explicitly.

5. Spark has built-in support for graph algorithms (including Bulk
Synchronous Parallel processing BSP algorithms e.g. Pregel), Machine
Learning and Stream processing. In Hadoop MR you need a separate
library/Framework for each and it is non-trivial to combine multiple of
these in the same application. This is huge!

6. In Spark one does have to learn how to configure the memory and other
parameters of their cluster. Just to be clear, similar parameters exist in
MR as well (e.g. shuffle memory parameters) but you don't *have* to learn
about tuning them until you have jobs with larger data size jobs. In Spark
you learn this by reading the configuration and tuning documentation
followed by experimentation. This is an area of Spark where things can be
better.

Java or Scala : I knew Java already yet I learnt Scala when I came across
Spark. As others have said, you can get started with a little bit of Scala
and learn more as you progress. Once you have started using Scala for a few
weeks you would want to stay with it instead of going back to Java. Scala
is arguably more elegant and less verbose than Java which translates into
higher developer productivity and more maintainable code.

Myth: Spark is for in-memory processing *only*. This is a common beginner
misunderstanding.

Sanjay: Spark uses Hadoop API for performing I/O from file systems (local,
HDFS, S3 etc). Therefore you can use the same Hadoop InputFormat and
RecordReader with Spark that you use with Hadoop for your multi-line record
format. See SparkContext APIs. Just like Hadoop, you will need to make sure
that your files are split at record boundaries.

Hope this is helpful.


On Sun, Nov 23, 2014 at 8:35 AM, Sanjay Subramanian <
sanjaysubraman...@yahoo.com.invalid> wrote:

> I am a newbie as well to Spark. Been Hadoop/Hive/Oozie programming
> extensively before this. I use Hadoop(Java MR code)/Hive/Impala/Presto on
> a daily basis.
>
> To get me jumpstarted into Spark I started this gitHub where there is
> "IntelliJ-ready-To-run" code (simple examples of jon, sparksql etc) and I
> will keep adding to that. I dont know scala and I am learning that too to
> help me use Spark better.
> https://github.com/sanjaysubramanian/msfx_scala.git
>
> Philosophically speaking its possibly not a good idea to take an either/or
> approach to technology...Like its never going to be either RDBMS or NOSQL
> (If the Cassandra behind FB shows 100 fewer likes instead of 1000 on you
> Photo a day for some reason u may not be as upset...but if the Oracle/Db2
> systems behind Wells Fargo show $100 LESS in your account due to an
> database error, you will be PANIC-ing).
>
>
> So its the same case with Spark or Hadoop. I can speak for myself. I have
> a usecase for processing old logs that are multiline (i.e. they have a
> [begin_timestamp_logid] and [end_timestamp_logid] and have many lines in
>  between. In Java Hadoop I created custom RecordReaders to solve this. I
> still dont know how to do this in Spark. Till that time I am possibly gonna
> run the Hadoop code within Oozie in production.
>
> Also my current task is evangelizing Big Data at my company. So the tech
> people I can educate with Hadoop and Spark and they would learn that but
> not the business intelligence analysts. They love SQL so I have to educate
> them using Hive , Presto, Impala...so the question is what is your task or
> tasks ?
>
>
> Sorry , a long non technical answer to your question...
>
> Make sense ?
>
> sanjay
>
>
>   ------------------------------
>  *From:* Krishna Sankar <ksanka...@gmail.com>
> *To:* Sean Owen <so...@cloudera.com>
> *Cc:* Guillermo Ortiz <konstt2...@gmail.com>; user <user@spark.apache.org>
>
> *Sent:* Saturday, November 22, 2014 4:53 PM
> *Subject:* Re: Spark or MR, Scala or Java?
>
> Adding to already interesting answers:
>
>    - "Is there any case where MR is better than Spark? I don't know what cases
>    I should be used Spark by MR. When is MR faster than Spark?"
>
>
>    - Many. MR would be better (am not saying faster ;o)) for
>
>
>    - Very large dataset,
>    - Multistage map-reduce flows,
>    - Complex map-reduce semantics
>
>
>    - Spark is definitely better for the classic iterative,interactive
>    workloads.
>    - Spark is very effective for implementing the concepts of in-memory
>    datasets & real time analytics
>
>
>    - Take a look at the Lambda architecture
>
>
>    - Also checkout how Ooyala is using Spark in multiple layers &
>    configurations. They also have MR in many places
>    - In our case, we found Spark very effective for ELT - we would have
>    used MR earlier
>
>
>    -  "I know Java, is it worth it to learn Scala for programming to
>    Spark or it's okay just with Java?"
>
>
>    - Java will work fine. Especially when Java 8 becomes the norm, we
>    will get back some of the elegance
>    - I, personally, like Scala & Python lot better than Java. Scala is a
>    lot more elegant, but compilations, IDE integration et al are still clunky
>    - One word of caution - stick with one language as much as
>    possible-shuffling between Java & Scala is not fun
>
> Cheers & HTH
> <k/>
>
>
>
> On Sat, Nov 22, 2014 at 8:26 AM, Sean Owen <so...@cloudera.com> wrote:
>
> MapReduce is simpler and narrower, which also means it is generally
> lighter weight, with less to know and configure, and runs more predictably.
> If you have a job that is truly just a few maps, with maybe one reduce, MR
> will likely be more efficient. Until recently its shuffle has been more
> developed and offers some semantics the Spark shuffle does not.
> I suppose it integrates with tools like Oozie, that Spark does not.
> I suggest learning enough Scala to use Spark in Scala. The amount you need
> to know is not large.
> (Mahout MR based implementations do not run on Spark and will not. They
> have been removed instead.)
> On Nov 22, 2014 3:36 PM, "Guillermo Ortiz" <konstt2...@gmail.com> wrote:
>
> Hello,
>
> I'm a newbie with Spark but I've been working with Hadoop for a while.
> I have two questions.
>
> Is there any case where MR is better than Spark? I don't know what
> cases I should be used Spark by MR. When is MR faster than Spark?
>
> The other question is, I know Java, is it worth it to learn Scala for
> programming to Spark or it's okay just with Java? I have done a little
> piece of code with Java because I feel more confident with it,, but I
> seems that I'm missed something
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>

Reply via email to