Re: Spark or MR, Scala or Java?

2014-11-23 Thread Sanjay Subramanian
I am a newbie as well to Spark. Been Hadoop/Hive/Oozie programming extensively 
before this. I use Hadoop(Java MR code)/Hive/Impala/Presto on a daily basis.
To get me jumpstarted into Spark I started this gitHub where there is 
IntelliJ-ready-To-run code (simple examples of jon, sparksql etc) and I will 
keep adding to that. I dont know scala and I am learning that too to help me 
use Spark better.https://github.com/sanjaysubramanian/msfx_scala.git

Philosophically speaking its possibly not a good idea to take an either/or 
approach to technology...Like its never going to be either RDBMS or NOSQL (If 
the Cassandra behind FB shows 100 fewer likes instead of 1000 on you Photo a 
day for some reason u may not be as upset...but if the Oracle/Db2 systems 
behind Wells Fargo show $100 LESS in your account due to an database error, you 
will be PANIC-ing).

So its the same case with Spark or Hadoop. I can speak for myself. I have a 
usecase for processing old logs that are multiline (i.e. they have a 
[begin_timestamp_logid] and [end_timestamp_logid] and have many lines in  
between. In Java Hadoop I created custom RecordReaders to solve this. I still 
dont know how to do this in Spark. Till that time I am possibly gonna run the 
Hadoop code within Oozie in production. 
Also my current task is evangelizing Big Data at my company. So the tech people 
I can educate with Hadoop and Spark and they would learn that but not the 
business intelligence analysts. They love SQL so I have to educate them using 
Hive , Presto, Impala...so the question is what is your task or tasks ?

Sorry , a long non technical answer to your question...
Make sense ?
sanjay     
  From: Krishna Sankar ksanka...@gmail.com
 To: Sean Owen so...@cloudera.com 
Cc: Guillermo Ortiz konstt2...@gmail.com; user user@spark.apache.org 
 Sent: Saturday, November 22, 2014 4:53 PM
 Subject: Re: Spark or MR, Scala or Java?
   
Adding to already interesting answers:   
   - Is there any case where MR is better than Spark? I don't know what cases 
I should be used Spark by MR. When is MR faster than Spark?   

   
   - Many. MR would be better (am not saying faster ;o)) for 
   
   - Very large dataset,
   - Multistage map-reduce flows,
   - Complex map-reduce semantics
   
   - Spark is definitely better for the classic iterative,interactive workloads.
   - Spark is very effective for implementing the concepts of in-memory 
datasets  real time analytics 
   
   - Take a look at the Lambda architecture
   
   - Also checkout how Ooyala is using Spark in multiple layers  
configurations. They also have MR in many places
   - In our case, we found Spark very effective for ELT - we would have used MR 
earlier
   
   -  I know Java, is it worth it to learn Scala for programming to Spark or 
it's okay just with Java?   

   
   - Java will work fine. Especially when Java 8 becomes the norm, we will get 
back some of the elegance
   - I, personally, like Scala  Python lot better than Java. Scala is a lot 
more elegant, but compilations, IDE integration et al are still clunky
   - One word of caution - stick with one language as much as 
possible-shuffling between Java  Scala is not fun
Cheers  HTHk/


On Sat, Nov 22, 2014 at 8:26 AM, Sean Owen so...@cloudera.com wrote:

MapReduce is simpler and narrower, which also means it is generally lighter 
weight, with less to know and configure, and runs more predictably. If you have 
a job that is truly just a few maps, with maybe one reduce, MR will likely be 
more efficient. Until recently its shuffle has been more developed and offers 
some semantics the Spark shuffle does not.I suppose it integrates with tools 
like Oozie, that Spark does not. I suggest learning enough Scala to use Spark 
in Scala. The amount you need to know is not large.(Mahout MR based 
implementations do not run on Spark and will not. They have been removed 
instead.)On Nov 22, 2014 3:36 PM, Guillermo Ortiz konstt2...@gmail.com 
wrote:

Hello,

I'm a newbie with Spark but I've been working with Hadoop for a while.
I have two questions.

Is there any case where MR is better than Spark? I don't know what
cases I should be used Spark by MR. When is MR faster than Spark?

The other question is, I know Java, is it worth it to learn Scala for
programming to Spark or it's okay just with Java? I have done a little
piece of code with Java because I feel more confident with it,, but I
seems that I'm missed something

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org






  

Re: Spark or MR, Scala or Java?

2014-11-23 Thread Ashish Rangole
 is what is your task or
 tasks ?


 Sorry , a long non technical answer to your question...

 Make sense ?

 sanjay


   --
  *From:* Krishna Sankar ksanka...@gmail.com
 *To:* Sean Owen so...@cloudera.com
 *Cc:* Guillermo Ortiz konstt2...@gmail.com; user user@spark.apache.org

 *Sent:* Saturday, November 22, 2014 4:53 PM
 *Subject:* Re: Spark or MR, Scala or Java?

 Adding to already interesting answers:

- Is there any case where MR is better than Spark? I don't know what cases
I should be used Spark by MR. When is MR faster than Spark?


- Many. MR would be better (am not saying faster ;o)) for


- Very large dataset,
- Multistage map-reduce flows,
- Complex map-reduce semantics


- Spark is definitely better for the classic iterative,interactive
workloads.
- Spark is very effective for implementing the concepts of in-memory
datasets  real time analytics


- Take a look at the Lambda architecture


- Also checkout how Ooyala is using Spark in multiple layers 
configurations. They also have MR in many places
- In our case, we found Spark very effective for ELT - we would have
used MR earlier


-  I know Java, is it worth it to learn Scala for programming to
Spark or it's okay just with Java?


- Java will work fine. Especially when Java 8 becomes the norm, we
will get back some of the elegance
- I, personally, like Scala  Python lot better than Java. Scala is a
lot more elegant, but compilations, IDE integration et al are still clunky
- One word of caution - stick with one language as much as
possible-shuffling between Java  Scala is not fun

 Cheers  HTH
 k/



 On Sat, Nov 22, 2014 at 8:26 AM, Sean Owen so...@cloudera.com wrote:

 MapReduce is simpler and narrower, which also means it is generally
 lighter weight, with less to know and configure, and runs more predictably.
 If you have a job that is truly just a few maps, with maybe one reduce, MR
 will likely be more efficient. Until recently its shuffle has been more
 developed and offers some semantics the Spark shuffle does not.
 I suppose it integrates with tools like Oozie, that Spark does not.
 I suggest learning enough Scala to use Spark in Scala. The amount you need
 to know is not large.
 (Mahout MR based implementations do not run on Spark and will not. They
 have been removed instead.)
 On Nov 22, 2014 3:36 PM, Guillermo Ortiz konstt2...@gmail.com wrote:

 Hello,

 I'm a newbie with Spark but I've been working with Hadoop for a while.
 I have two questions.

 Is there any case where MR is better than Spark? I don't know what
 cases I should be used Spark by MR. When is MR faster than Spark?

 The other question is, I know Java, is it worth it to learn Scala for
 programming to Spark or it's okay just with Java? I have done a little
 piece of code with Java because I feel more confident with it,, but I
 seems that I'm missed something

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Re: Spark or MR, Scala or Java?

2014-11-23 Thread Ognen Duzlevski
On Sun, Nov 23, 2014 at 1:03 PM, Ashish Rangole arang...@gmail.com wrote:

 Java or Scala : I knew Java already yet I learnt Scala when I came across
 Spark. As others have said, you can get started with a little bit of Scala
 and learn more as you progress. Once you have started using Scala for a few
 weeks you would want to stay with it instead of going back to Java. Scala
 is arguably more elegant and less verbose than Java which translates into
 higher developer productivity and more maintainable code.


Scala is arguably more elegant and less verbose than Java. However, Scala
is also a complex language with a lot of details and tidbits and one-offs
that you just have to remember.  It is sometimes difficult to make a
decision whether what you wrote is the using the language features most
effectively or if you missed out on an available feature that could have
made the code better or more concise. For Spark you really do not need to
know that much Scala but you do need to understand the essence of it.

Thanks for the good discussion! :-)
Ognen


Re: Spark or MR, Scala or Java?

2014-11-23 Thread Krishna Sankar
Good point.
On the positive side, whether we choose the most efficient mechanism in
Scala might not be as important, as the Spark framework mediates the
distributed computation. Even if there is some declarative part in Spark,
we can still choose an inefficient computation path that is not apparent to
the framework.
Cheers
k/
P.S: Now Reply to ALL

On Sun, Nov 23, 2014 at 11:44 AM, Ognen Duzlevski ognen.duzlev...@gmail.com
 wrote:

 On Sun, Nov 23, 2014 at 1:03 PM, Ashish Rangole arang...@gmail.com
 wrote:

 Java or Scala : I knew Java already yet I learnt Scala when I came across
 Spark. As others have said, you can get started with a little bit of Scala
 and learn more as you progress. Once you have started using Scala for a few
 weeks you would want to stay with it instead of going back to Java. Scala
 is arguably more elegant and less verbose than Java which translates into
 higher developer productivity and more maintainable code.


 Scala is arguably more elegant and less verbose than Java. However, Scala
 is also a complex language with a lot of details and tidbits and one-offs
 that you just have to remember.  It is sometimes difficult to make a
 decision whether what you wrote is the using the language features most
 effectively or if you missed out on an available feature that could have
 made the code better or more concise. For Spark you really do not need to
 know that much Scala but you do need to understand the essence of it.

 Thanks for the good discussion! :-)
 Ognen



Re: Spark or MR, Scala or Java?

2014-11-23 Thread Krishna Sankar
A very timely article
http://rahulkavale.github.io/blog/2014/11/16/scrap-your-map-reduce/
Cheers
k/
P.S: Now reply to ALL.

On Sun, Nov 23, 2014 at 7:16 PM, Krishna Sankar ksanka...@gmail.com wrote:

 Good point.
 On the positive side, whether we choose the most efficient mechanism in
 Scala might not be as important, as the Spark framework mediates the
 distributed computation. Even if there is some declarative part in Spark,
 we can still choose an inefficient computation path that is not apparent to
 the framework.
 Cheers
 k/
 P.S: Now Reply to ALL

 On Sun, Nov 23, 2014 at 11:44 AM, Ognen Duzlevski 
 ognen.duzlev...@gmail.com wrote:

 On Sun, Nov 23, 2014 at 1:03 PM, Ashish Rangole arang...@gmail.com
 wrote:

 Java or Scala : I knew Java already yet I learnt Scala when I came
 across Spark. As others have said, you can get started with a little bit of
 Scala and learn more as you progress. Once you have started using Scala for
 a few weeks you would want to stay with it instead of going back to Java.
 Scala is arguably more elegant and less verbose than Java which translates
 into higher developer productivity and more maintainable code.


 Scala is arguably more elegant and less verbose than Java. However, Scala
 is also a complex language with a lot of details and tidbits and one-offs
 that you just have to remember.  It is sometimes difficult to make a
 decision whether what you wrote is the using the language features most
 effectively or if you missed out on an available feature that could have
 made the code better or more concise. For Spark you really do not need to
 know that much Scala but you do need to understand the essence of it.

 Thanks for the good discussion! :-)
 Ognen





Re: Spark or MR, Scala or Java?

2014-11-23 Thread Sanjay Subramanian
Thanks a ton Ashishsanjay
  From: Ashish Rangole arang...@gmail.com
 To: Sanjay Subramanian sanjaysubraman...@yahoo.com 
Cc: Krishna Sankar ksanka...@gmail.com; Sean Owen so...@cloudera.com; 
Guillermo Ortiz konstt2...@gmail.com; user user@spark.apache.org 
 Sent: Sunday, November 23, 2014 11:03 AM
 Subject: Re: Spark or MR, Scala or Java?
   
This being a very broad topic, a discussion can quickly get subjective. I'll 
try not to deviate from my experiences and observations to keep this thread 
useful to those looking for answers.
I have used Hadoop MR (with Hive, MR Java apis, Cascading and Scalding) as well 
as Spark (since v 0.6) in Scala. I learnt Scala for using Spark. My 
observations are below.
Spark and Hadoop MR:1. There doesn't have to be a dichotomy between Hadoop 
ecosystem and Spark since Spark is a part of it.
2. Spark or Hadoop MR, there is no getting away from learning how partitioning, 
input splits, and shuffle process work. In order to optimize performance, 
troubleshoot and design software one must know these. I recommend reading first 
6-7 chapters of Hadoop The definitive Guide book to develop initial 
understanding. Indeed knowing a couple of divide and conquer algorithms is a 
pre-requisite and I assume everyone on this mailing list is very familiar :)
3. Having used a lot of different APIs and layers of abstraction for Hadoop MR, 
my experience progressing from MR Java API -- Cascading -- Scalding is that 
each new API looks simpler than the previous one. However, Spark API and 
abstraction has been simplest. Not only for me but those who I have seen start 
with Hadoop MR or Spark first. It is easiest to get started and become 
productive with Spark with the exception of Hive for those who are already 
familiar with SQL. Spark's ease of use is critical for teams starting out with 
Big Data.
4. It is also extremely simple to chain multi-stage jobs in Spark, you do it 
without even realizing by operating over RDDs. In Hadoop MR, one has to handle 
it explicitly.
5. Spark has built-in support for graph algorithms (including Bulk Synchronous 
Parallel processing BSP algorithms e.g. Pregel), Machine Learning and Stream 
processing. In Hadoop MR you need a separate library/Framework for each and it 
is non-trivial to combine multiple of these in the same application. This is 
huge!
6. In Spark one does have to learn how to configure the memory and other 
parameters of their cluster. Just to be clear, similar parameters exist in MR 
as well (e.g. shuffle memory parameters) but you don't *have* to learn about 
tuning them until you have jobs with larger data size jobs. In Spark you learn 
this by reading the configuration and tuning documentation followed by 
experimentation. This is an area of Spark where things can be better.
Java or Scala : I knew Java already yet I learnt Scala when I came across 
Spark. As others have said, you can get started with a little bit of Scala and 
learn more as you progress. Once you have started using Scala for a few weeks 
you would want to stay with it instead of going back to Java. Scala is arguably 
more elegant and less verbose than Java which translates into higher developer 
productivity and more maintainable code.
Myth: Spark is for in-memory processing *only*. This is a common beginner 
misunderstanding.
Sanjay: Spark uses Hadoop API for performing I/O from file systems (local, 
HDFS, S3 etc). Therefore you can use the same Hadoop InputFormat and 
RecordReader with Spark that you use with Hadoop for your multi-line record 
format. See SparkContext APIs. Just like Hadoop, you will need to make sure 
that your files are split at record boundaries.
Hope this is helpful.



On Sun, Nov 23, 2014 at 8:35 AM, Sanjay Subramanian 
sanjaysubraman...@yahoo.com.invalid wrote:

I am a newbie as well to Spark. Been Hadoop/Hive/Oozie programming extensively 
before this. I use Hadoop(Java MR code)/Hive/Impala/Presto on a daily basis.
To get me jumpstarted into Spark I started this gitHub where there is 
IntelliJ-ready-To-run code (simple examples of jon, sparksql etc) and I will 
keep adding to that. I dont know scala and I am learning that too to help me 
use Spark better.https://github.com/sanjaysubramanian/msfx_scala.git

Philosophically speaking its possibly not a good idea to take an either/or 
approach to technology...Like its never going to be either RDBMS or NOSQL (If 
the Cassandra behind FB shows 100 fewer likes instead of 1000 on you Photo a 
day for some reason u may not be as upset...but if the Oracle/Db2 systems 
behind Wells Fargo show $100 LESS in your account due to an database error, you 
will be PANIC-ing).

So its the same case with Spark or Hadoop. I can speak for myself. I have a 
usecase for processing old logs that are multiline (i.e. they have a 
[begin_timestamp_logid] and [end_timestamp_logid] and have many lines in  
between. In Java Hadoop I created custom RecordReaders to solve this. I still 
dont know how to do this in Spark

RE: Spark or MR, Scala or Java?

2014-11-22 Thread Ashic Mahtab
Spark can do Map Reduce and more, and faster.
One area where using MR would make sense is if you're using something (maybe 
like Mahout) that doesn't understand Spark yet (Mahout may be Spark compatible 
now...just pulled that name out of thin air!).
You *can* use Spark from Java, but you'd have a MUCH better time using Scala. 
You don't necessarily need to know heaps of Scala to get stuff done in Spark. 
I'm not from a JVM background, having been in the .NET world for most of my 
career, and I haven't found scala at all difficult. And considering the amount 
of stuff in Spark that's built on or uses Scala, it'll always be first class. 
If you write Spark stuff in Java, you'll need a) a LOT more code, and b) will 
have to deal with Spark bridging classes that are provided to overcome 
deficiencies in Java.
Hope that helps.
 Date: Sat, 22 Nov 2014 16:34:04 +0100
 Subject: Spark or MR, Scala or Java?
 From: konstt2...@gmail.com
 To: user@spark.apache.org
 
 Hello,
 
 I'm a newbie with Spark but I've been working with Hadoop for a while.
 I have two questions.
 
 Is there any case where MR is better than Spark? I don't know what
 cases I should be used Spark by MR. When is MR faster than Spark?
 
 The other question is, I know Java, is it worth it to learn Scala for
 programming to Spark or it's okay just with Java? I have done a little
 piece of code with Java because I feel more confident with it,, but I
 seems that I'm missed something
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
  

Re: Spark or MR, Scala or Java?

2014-11-22 Thread Denny Lee
Just to add some more stuff - there are various scenarios where traditional
Hadoop makes more sense than Spark. For example, if you have a long running
processing job in which you do not want to utilize too many resources of
the cluster. Another example could be that you want to run a distributed
extraction job against multiple data sources via Hadoop streaming.

Another good call out but utilizing Scala within Spark is that most of the
Spark code is written in Scala.
On Sat, Nov 22, 2014 at 08:12 Denny Lee denny.g@gmail.com wrote:

 There are various scenarios where traditional Hadoop makes more sense than
 Spark. For example, if you have a long running processing job in which you
 do not want to utilize too many resources of the cluster. Another example
 could be that you want to run a distributed extraction job against multiple
 data sources via Hadoop streaming.
 On Sat, Nov 22, 2014 at 07:36 Guillermo Ortiz konstt2...@gmail.com
 wrote:

 Hello,

 I'm a newbie with Spark but I've been working with Hadoop for a while.
 I have two questions.

 Is there any case where MR is better than Spark? I don't know what
 cases I should be used Spark by MR. When is MR faster than Spark?

 The other question is, I know Java, is it worth it to learn Scala for
 programming to Spark or it's okay just with Java? I have done a little
 piece of code with Java because I feel more confident with it,, but I
 seems that I'm missed something

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark or MR, Scala or Java?

2014-11-22 Thread Sean Owen
MapReduce is simpler and narrower, which also means it is generally lighter
weight, with less to know and configure, and runs more predictably. If you
have a job that is truly just a few maps, with maybe one reduce, MR will
likely be more efficient. Until recently its shuffle has been more
developed and offers some semantics the Spark shuffle does not.

I suppose it integrates with tools like Oozie, that Spark does not.

I suggest learning enough Scala to use Spark in Scala. The amount you need
to know is not large.

(Mahout MR based implementations do not run on Spark and will not. They
have been removed instead.)
On Nov 22, 2014 3:36 PM, Guillermo Ortiz konstt2...@gmail.com wrote:

 Hello,

 I'm a newbie with Spark but I've been working with Hadoop for a while.
 I have two questions.

 Is there any case where MR is better than Spark? I don't know what
 cases I should be used Spark by MR. When is MR faster than Spark?

 The other question is, I know Java, is it worth it to learn Scala for
 programming to Spark or it's okay just with Java? I have done a little
 piece of code with Java because I feel more confident with it,, but I
 seems that I'm missed something

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark or MR, Scala or Java?

2014-11-22 Thread Krishna Sankar
Adding to already interesting answers:

   - Is there any case where MR is better than Spark? I don't know what cases
   I should be used Spark by MR. When is MR faster than Spark?
   - Many. MR would be better (am not saying faster ;o)) for
 - Very large dataset,
 - Multistage map-reduce flows,
 - Complex map-reduce semantics
  - Spark is definitely better for the classic iterative,interactive
  workloads.
  - Spark is very effective for implementing the concepts of in-memory
  datasets  real time analytics
 - Take a look at the Lambda architecture
  - Also checkout how Ooyala is using Spark in multiple layers 
  configurations. They also have MR in many places
  - In our case, we found Spark very effective for ELT - we would have
  used MR earlier
   -  I know Java, is it worth it to learn Scala for programming to Spark
   or it's okay just with Java?
   - Java will work fine. Especially when Java 8 becomes the norm, we will
  get back some of the elegance
  - I, personally, like Scala  Python lot better than Java. Scala is a
  lot more elegant, but compilations, IDE integration et al are still clunky
  - One word of caution - stick with one language as much as
  possible-shuffling between Java  Scala is not fun

Cheers  HTH
k/

On Sat, Nov 22, 2014 at 8:26 AM, Sean Owen so...@cloudera.com wrote:

 MapReduce is simpler and narrower, which also means it is generally
 lighter weight, with less to know and configure, and runs more predictably.
 If you have a job that is truly just a few maps, with maybe one reduce, MR
 will likely be more efficient. Until recently its shuffle has been more
 developed and offers some semantics the Spark shuffle does not.

 I suppose it integrates with tools like Oozie, that Spark does not.

 I suggest learning enough Scala to use Spark in Scala. The amount you need
 to know is not large.

 (Mahout MR based implementations do not run on Spark and will not. They
 have been removed instead.)
 On Nov 22, 2014 3:36 PM, Guillermo Ortiz konstt2...@gmail.com wrote:

 Hello,

 I'm a newbie with Spark but I've been working with Hadoop for a while.
 I have two questions.

 Is there any case where MR is better than Spark? I don't know what
 cases I should be used Spark by MR. When is MR faster than Spark?

 The other question is, I know Java, is it worth it to learn Scala for
 programming to Spark or it's okay just with Java? I have done a little
 piece of code with Java because I feel more confident with it,, but I
 seems that I'm missed something

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark or MR, Scala or Java?

2014-11-22 Thread Soumya Simanta
Thanks Sean.

adding user@spark.apache.org again.

On Sat, Nov 22, 2014 at 9:35 PM, Sean Owen so...@cloudera.com wrote:

 On Sun, Nov 23, 2014 at 2:20 AM, Soumya Simanta
 soumya.sima...@gmail.com wrote:
  Is the MapReduce API simpler or the implementation? Almost, every Spark
  presentation has a slide that shows 100+ lines of Hadoop MR code in Java
 and
  the same feature implemented in 3 lines of Scala code on Spark. So the
 Spark
  API is certainly simpler, at least based on what I know. What am I
 missing
  here?

 The implementation is simpler. The API is not. However I don't think
 anyone 'really' uses the M/R API directly now. They use Crunch or
 maybe Cascading. These are also much less than 100 lines for word
 count, on top of M/R.

  Can you please expand on what you mean by efficient ? Better
 performance
  and/or reliability,  fewer resources or something else?

 All of the above. Map/Reduce is simple and easy to understand, and
 Spark is actually hard to reason about, and heavy-weight. Of course,
 as soon as your work spans more than one MapReduce, this reasoning
 changes a lot. But MapReduce is better for truly map-only, or
 map-with-a-reduce-only, workloads. It is optimized for this case. The
 shuffle is still better.