Is Spark in Java a bad idea?

2014-10-28 Thread Ron Ayoub
I haven't learned Scala yet so as you might imagine I'm having challenges 
working with Spark from the Java API. For one thing, it seems very limited in 
comparison to Scala. I ran into a problem really quick. I need to hydrate an 
RDD from JDBC/Oracle and so I wanted to use the JdbcRDD. But that is part of 
the spark api and I'm unable to get the compiler to accept various parameters. 
I looked at the code and I noticed that JdbcRDD doesn't add much value and just 
implements compute and partition. I figured I can do that myself with better 
looking JDBC code. So I created a class inheriting from RDD that was heavily 
decorated with stuff I have never seen before. Next, I recalled that I have to 
use the JavaRDD. Of course, that class doesn't have those methods that you can 
override. 
From where I'm standing right now, it really appears that Spark doesn't really 
support Java and that if you really want to use it you need to learn Scala. Is 
this a correct assessment? 

  

Re: Is Spark in Java a bad idea?

2014-10-28 Thread critikaled
Hi Ron,
what ever api you have in scala you can possibly use it form java. scala is
inter-operable with java and vice versa. scala being both object oriented
and functional will make your job easier on jvm and it is more consise than
java. Take it as an opportunity and start learning scala ;).



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-in-Java-a-bad-idea-tp17534p17538.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is Spark in Java a bad idea?

2014-10-28 Thread Matei Zaharia
A pretty large fraction of users use Java, but a few features are still not 
available in it. JdbcRDD is one of them -- this functionality will likely be 
superseded by Spark SQL when we add JDBC as a data source. In the meantime, to 
use it, I'd recommend writing a class in Scala that has Java-friendly methods 
and getting an RDD to it from that. Basically the two parameters that weren't 
friendly there were the ClassTag and the getConnection and mapRow functions.

Subclassing RDD in Java is also not really supported, because that's an 
internal API. We don't expect users to be defining their own RDDs.

Matei

 On Oct 28, 2014, at 11:47 AM, critikaled isasmani@gmail.com wrote:
 
 Hi Ron,
 what ever api you have in scala you can possibly use it form java. scala is
 inter-operable with java and vice versa. scala being both object oriented
 and functional will make your job easier on jvm and it is more consise than
 java. Take it as an opportunity and start learning scala ;).
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-in-Java-a-bad-idea-tp17534p17538.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Is Spark in Java a bad idea?

2014-10-28 Thread Ron Ayoub
I interpret this to mean you have to learn Scala in order to work with Spark in 
Scala (goes without saying) and also to work with Spark in Java (since you have 
to jump through some hoops for basic functionality).
The best path here is to take this as a learning opportunity and sit down and 
learn Scala. 
Regarding RDD being an internal API, it has two methods that clearly allow you 
to override them which the JdbcRDD does and it looks close to trivial - if I 
only new Scala. Once I learn Scala, I would say the first thing I plan on doing 
is writing my own OracleRDD with my own flavor of Jdbc code. Why would this not 
be advisable? 
 Subject: Re: Is Spark in Java a bad idea?
 From: matei.zaha...@gmail.com
 Date: Tue, 28 Oct 2014 11:56:39 -0700
 CC: u...@spark.incubator.apache.org
 To: isasmani@gmail.com
 
 A pretty large fraction of users use Java, but a few features are still not 
 available in it. JdbcRDD is one of them -- this functionality will likely be 
 superseded by Spark SQL when we add JDBC as a data source. In the meantime, 
 to use it, I'd recommend writing a class in Scala that has Java-friendly 
 methods and getting an RDD to it from that. Basically the two parameters that 
 weren't friendly there were the ClassTag and the getConnection and mapRow 
 functions.
 
 Subclassing RDD in Java is also not really supported, because that's an 
 internal API. We don't expect users to be defining their own RDDs.
 
 Matei
 
  On Oct 28, 2014, at 11:47 AM, critikaled isasmani@gmail.com wrote:
  
  Hi Ron,
  what ever api you have in scala you can possibly use it form java. scala is
  inter-operable with java and vice versa. scala being both object oriented
  and functional will make your job easier on jvm and it is more consise than
  java. Take it as an opportunity and start learning scala ;).
  
  
  
  --
  View this message in context: 
  http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-in-Java-a-bad-idea-tp17534p17538.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
  
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
  
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
  

Re: Is Spark in Java a bad idea?

2014-10-28 Thread Matei Zaharia
The overridable methods of RDD are marked as @DeveloperApi, which means that 
these are internal APIs used by people that might want to extend Spark, but are 
not guaranteed to remain stable across Spark versions (unlike Spark's public 
APIs).

BTW, if you want a way to do this that does not involve JdbcRDD or internal 
APIs, you can use SoarkContext.paralellize followed by mapPartitions to read a 
subset of the data in each of your tasks. That can be done purely in Java. 
You'd probably parallelize a collection that contains ranges of the table that 
you want to scan, then open a connection to the DB in each task (in 
mapPartitions) and read the records from that range.

Matei

 On Oct 28, 2014, at 12:15 PM, Ron Ayoub ronalday...@live.com wrote:
 
 I interpret this to mean you have to learn Scala in order to work with Spark 
 in Scala (goes without saying) and also to work with Spark in Java (since you 
 have to jump through some hoops for basic functionality).
 
 The best path here is to take this as a learning opportunity and sit down and 
 learn Scala. 
 
 Regarding RDD being an internal API, it has two methods that clearly allow 
 you to override them which the JdbcRDD does and it looks close to trivial - 
 if I only new Scala. Once I learn Scala, I would say the first thing I plan 
 on doing is writing my own OracleRDD with my own flavor of Jdbc code. Why 
 would this not be advisable?
  
 
  Subject: Re: Is Spark in Java a bad idea?
  From: matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com
  Date: Tue, 28 Oct 2014 11:56:39 -0700
  CC: u...@spark.incubator.apache.org mailto:u...@spark.incubator.apache.org
  To: isasmani@gmail.com mailto:isasmani@gmail.com
  
  A pretty large fraction of users use Java, but a few features are still not 
  available in it. JdbcRDD is one of them -- this functionality will likely 
  be superseded by Spark SQL when we add JDBC as a data source. In the 
  meantime, to use it, I'd recommend writing a class in Scala that has 
  Java-friendly methods and getting an RDD to it from that. Basically the two 
  parameters that weren't friendly there were the ClassTag and the 
  getConnection and mapRow functions.
  
  Subclassing RDD in Java is also not really supported, because that's an 
  internal API. We don't expect users to be defining their own RDDs.
  
  Matei
  
   On Oct 28, 2014, at 11:47 AM, critikaled isasmani@gmail.com 
   mailto:isasmani@gmail.com wrote:
   
   Hi Ron,
   what ever api you have in scala you can possibly use it form java. scala 
   is
   inter-operable with java and vice versa. scala being both object oriented
   and functional will make your job easier on jvm and it is more consise 
   than
   java. Take it as an opportunity and start learning scala ;).
   
   
   
   --
   View this message in context: 
   http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-in-Java-a-bad-idea-tp17534p17538.html
   Sent from the Apache Spark User List mailing list archive at Nabble.com 
   http://nabble.com/.
   
   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org 
   mailto:user-h...@spark.apache.org
   
  
  
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
  mailto:user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org 
  mailto:user-h...@spark.apache.org
  



Re: Is Spark in Java a bad idea?

2014-10-28 Thread Mark Hamstra
I believe that you are overstating your case.

If you want to work with with Spark, then the Java API is entirely adequate
with a very few exceptions -- unfortunately, though, one of those
exceptions is with something that you are interested in, JdbcRDD.

If you want to work on Spark -- customizing, extending, or contributing to
it, then working in Scala is pretty much unavoidable if your work is of any
significant depth.

That being said, I expect that there are very few Spark users who are
comfortable with the Scala API who would voluntarily choose to regularly
use the Java or Python APIs, so taking the opportunity to learn Scala isn't
a bad thing.

On Tue, Oct 28, 2014 at 12:15 PM, Ron Ayoub ronalday...@live.com wrote:

 I interpret this to mean you have to learn Scala in order to work with
 Spark in Scala (goes without saying) and also to work with Spark in Java
 (since you have to jump through some hoops for basic functionality).

 The best path here is to take this as a learning opportunity and sit down
 and learn Scala.

 Regarding RDD being an internal API, it has two methods that clearly allow
 you to override them which the JdbcRDD does and it looks close to trivial -
 if I only new Scala. Once I learn Scala, I would say the first thing I plan
 on doing is writing my own OracleRDD with my own flavor of Jdbc code. Why
 would this not be advisable?


  Subject: Re: Is Spark in Java a bad idea?
  From: matei.zaha...@gmail.com
  Date: Tue, 28 Oct 2014 11:56:39 -0700
  CC: u...@spark.incubator.apache.org
  To: isasmani@gmail.com

 
  A pretty large fraction of users use Java, but a few features are still
 not available in it. JdbcRDD is one of them -- this functionality will
 likely be superseded by Spark SQL when we add JDBC as a data source. In the
 meantime, to use it, I'd recommend writing a class in Scala that has
 Java-friendly methods and getting an RDD to it from that. Basically the two
 parameters that weren't friendly there were the ClassTag and the
 getConnection and mapRow functions.
 
  Subclassing RDD in Java is also not really supported, because that's an
 internal API. We don't expect users to be defining their own RDDs.
 
  Matei
 
   On Oct 28, 2014, at 11:47 AM, critikaled isasmani@gmail.com
 wrote:
  
   Hi Ron,
   what ever api you have in scala you can possibly use it form java.
 scala is
   inter-operable with java and vice versa. scala being both object
 oriented
   and functional will make your job easier on jvm and it is more consise
 than
   java. Take it as an opportunity and start learning scala ;).
  
  
  
   --
   View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-in-Java-a-bad-idea-tp17534p17538.html
   Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
  
   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org
  
 
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



Re: Is Spark in Java a bad idea?

2014-10-28 Thread Kevin Markey

  
  
Don't be too concerned about the Scala hoop.  Before making the
commitment to Scala, I had coded up a modest analytic prototype in
Hadoop mapreduce.  Once making the commitment, it took 10 days to
(1) learn enough Scala, and (2) re-write the prototype in Spark in
Scala.  In so doing, the execution time for this prototype was cut
in 1/8 and the lines of code for identical functionality was about
1/10.  

A few things helped me...

- Martin Odersky's "Programming in Scala".  No need to read the
whole thing, but use it as a reference and together with the course.
- His "Functional Programming Principles in Scala" on Coursera. 
It's not necessary that you enroll in a concurrent course.  "Enroll"
in a past course and watch the videos and do a few exercises. 
https://class.coursera.org/progfun-003
- The cheat-cheats on the Scala website. 
http://docs.scala-lang.org/cheatsheets/?_ga=1.267044046.1769090313.1387491444
- Example code in Spark.  Plenty of it to go around.

Once you have experienced the glories of Scala, there's no turning
back.  It is a computer science cornucopia!

Kevin


On 10/28/2014 01:15 PM, Ron Ayoub
  wrote:


  
  I interpret this to mean you have to learn Scala in
order to work with Spark in Scala (goes without saying) and also
to work with Spark in Java (since you have to jump through some
hoops for basic functionality).


The best path here is to take this as a learning
  opportunity and sit down and learn Scala. 


Regarding RDD being an internal API, it has two methods
  that clearly allow you to override them which the JdbcRDD does
  and it looks close to trivial - if I only new Scala. Once I
  learn Scala, I would say the first thing I plan on doing is
  writing my own OracleRDD with my own flavor of Jdbc code. Why
  would this not be advisable?
 

 Subject: Re: Is Spark in Java a bad idea?
   From: matei.zaha...@gmail.com
   Date: Tue, 28 Oct 2014 11:56:39 -0700
   CC: u...@spark.incubator.apache.org
   To: isasmani@gmail.com
   
   A pretty large fraction of users use Java, but a few
  features are still not available in it. JdbcRDD is one of them
  -- this functionality will likely be superseded by Spark SQL
  when we add JDBC as a data source. In the meantime, to use it,
  I'd recommend writing a class in Scala that has Java-friendly
  methods and getting an RDD to it from that. Basically the two
  parameters that weren't friendly there were the ClassTag and
  the getConnection and mapRow functions.
   
   Subclassing RDD in Java is also not really supported,
  because that's an internal API. We don't expect users to be
  defining their own RDDs.
   
   Matei
   
On Oct 28, 2014, at 11:47 AM, critikaled
  isasmani@gmail.com wrote:

Hi Ron,
what ever api you have in scala you can possibly use
  it form java. scala is
inter-operable with java and vice versa. scala being
  both object oriented
and functional will make your job easier on jvm and
  it is more consise than
java. Take it as an opportunity and start learning
  scala ;).



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-in-Java-a-bad-idea-tp17534p17538.html
Sent from the Apache Spark User List mailing list
  archive at Nabble.com.

   
  -
To unsubscribe, e-mail:
  user-unsubscr...@spark.apache.org
For additional commands, e-mail:
  user-h...@spark.apache.org

   
   
  
  -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail:
  user-h...@spark.apache.org
   

  


  


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org