Is Spark in Java a bad idea?
I haven't learned Scala yet so as you might imagine I'm having challenges working with Spark from the Java API. For one thing, it seems very limited in comparison to Scala. I ran into a problem really quick. I need to hydrate an RDD from JDBC/Oracle and so I wanted to use the JdbcRDD. But that is part of the spark api and I'm unable to get the compiler to accept various parameters. I looked at the code and I noticed that JdbcRDD doesn't add much value and just implements compute and partition. I figured I can do that myself with better looking JDBC code. So I created a class inheriting from RDD that was heavily decorated with stuff I have never seen before. Next, I recalled that I have to use the JavaRDD. Of course, that class doesn't have those methods that you can override. From where I'm standing right now, it really appears that Spark doesn't really support Java and that if you really want to use it you need to learn Scala. Is this a correct assessment?
Re: Is Spark in Java a bad idea?
Hi Ron, what ever api you have in scala you can possibly use it form java. scala is inter-operable with java and vice versa. scala being both object oriented and functional will make your job easier on jvm and it is more consise than java. Take it as an opportunity and start learning scala ;). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-in-Java-a-bad-idea-tp17534p17538.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is Spark in Java a bad idea?
A pretty large fraction of users use Java, but a few features are still not available in it. JdbcRDD is one of them -- this functionality will likely be superseded by Spark SQL when we add JDBC as a data source. In the meantime, to use it, I'd recommend writing a class in Scala that has Java-friendly methods and getting an RDD to it from that. Basically the two parameters that weren't friendly there were the ClassTag and the getConnection and mapRow functions. Subclassing RDD in Java is also not really supported, because that's an internal API. We don't expect users to be defining their own RDDs. Matei On Oct 28, 2014, at 11:47 AM, critikaled isasmani@gmail.com wrote: Hi Ron, what ever api you have in scala you can possibly use it form java. scala is inter-operable with java and vice versa. scala being both object oriented and functional will make your job easier on jvm and it is more consise than java. Take it as an opportunity and start learning scala ;). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-in-Java-a-bad-idea-tp17534p17538.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Is Spark in Java a bad idea?
I interpret this to mean you have to learn Scala in order to work with Spark in Scala (goes without saying) and also to work with Spark in Java (since you have to jump through some hoops for basic functionality). The best path here is to take this as a learning opportunity and sit down and learn Scala. Regarding RDD being an internal API, it has two methods that clearly allow you to override them which the JdbcRDD does and it looks close to trivial - if I only new Scala. Once I learn Scala, I would say the first thing I plan on doing is writing my own OracleRDD with my own flavor of Jdbc code. Why would this not be advisable? Subject: Re: Is Spark in Java a bad idea? From: matei.zaha...@gmail.com Date: Tue, 28 Oct 2014 11:56:39 -0700 CC: u...@spark.incubator.apache.org To: isasmani@gmail.com A pretty large fraction of users use Java, but a few features are still not available in it. JdbcRDD is one of them -- this functionality will likely be superseded by Spark SQL when we add JDBC as a data source. In the meantime, to use it, I'd recommend writing a class in Scala that has Java-friendly methods and getting an RDD to it from that. Basically the two parameters that weren't friendly there were the ClassTag and the getConnection and mapRow functions. Subclassing RDD in Java is also not really supported, because that's an internal API. We don't expect users to be defining their own RDDs. Matei On Oct 28, 2014, at 11:47 AM, critikaled isasmani@gmail.com wrote: Hi Ron, what ever api you have in scala you can possibly use it form java. scala is inter-operable with java and vice versa. scala being both object oriented and functional will make your job easier on jvm and it is more consise than java. Take it as an opportunity and start learning scala ;). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-in-Java-a-bad-idea-tp17534p17538.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is Spark in Java a bad idea?
The overridable methods of RDD are marked as @DeveloperApi, which means that these are internal APIs used by people that might want to extend Spark, but are not guaranteed to remain stable across Spark versions (unlike Spark's public APIs). BTW, if you want a way to do this that does not involve JdbcRDD or internal APIs, you can use SoarkContext.paralellize followed by mapPartitions to read a subset of the data in each of your tasks. That can be done purely in Java. You'd probably parallelize a collection that contains ranges of the table that you want to scan, then open a connection to the DB in each task (in mapPartitions) and read the records from that range. Matei On Oct 28, 2014, at 12:15 PM, Ron Ayoub ronalday...@live.com wrote: I interpret this to mean you have to learn Scala in order to work with Spark in Scala (goes without saying) and also to work with Spark in Java (since you have to jump through some hoops for basic functionality). The best path here is to take this as a learning opportunity and sit down and learn Scala. Regarding RDD being an internal API, it has two methods that clearly allow you to override them which the JdbcRDD does and it looks close to trivial - if I only new Scala. Once I learn Scala, I would say the first thing I plan on doing is writing my own OracleRDD with my own flavor of Jdbc code. Why would this not be advisable? Subject: Re: Is Spark in Java a bad idea? From: matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com Date: Tue, 28 Oct 2014 11:56:39 -0700 CC: u...@spark.incubator.apache.org mailto:u...@spark.incubator.apache.org To: isasmani@gmail.com mailto:isasmani@gmail.com A pretty large fraction of users use Java, but a few features are still not available in it. JdbcRDD is one of them -- this functionality will likely be superseded by Spark SQL when we add JDBC as a data source. In the meantime, to use it, I'd recommend writing a class in Scala that has Java-friendly methods and getting an RDD to it from that. Basically the two parameters that weren't friendly there were the ClassTag and the getConnection and mapRow functions. Subclassing RDD in Java is also not really supported, because that's an internal API. We don't expect users to be defining their own RDDs. Matei On Oct 28, 2014, at 11:47 AM, critikaled isasmani@gmail.com mailto:isasmani@gmail.com wrote: Hi Ron, what ever api you have in scala you can possibly use it form java. scala is inter-operable with java and vice versa. scala being both object oriented and functional will make your job easier on jvm and it is more consise than java. Take it as an opportunity and start learning scala ;). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-in-Java-a-bad-idea-tp17534p17538.html Sent from the Apache Spark User List mailing list archive at Nabble.com http://nabble.com/. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org
Re: Is Spark in Java a bad idea?
I believe that you are overstating your case. If you want to work with with Spark, then the Java API is entirely adequate with a very few exceptions -- unfortunately, though, one of those exceptions is with something that you are interested in, JdbcRDD. If you want to work on Spark -- customizing, extending, or contributing to it, then working in Scala is pretty much unavoidable if your work is of any significant depth. That being said, I expect that there are very few Spark users who are comfortable with the Scala API who would voluntarily choose to regularly use the Java or Python APIs, so taking the opportunity to learn Scala isn't a bad thing. On Tue, Oct 28, 2014 at 12:15 PM, Ron Ayoub ronalday...@live.com wrote: I interpret this to mean you have to learn Scala in order to work with Spark in Scala (goes without saying) and also to work with Spark in Java (since you have to jump through some hoops for basic functionality). The best path here is to take this as a learning opportunity and sit down and learn Scala. Regarding RDD being an internal API, it has two methods that clearly allow you to override them which the JdbcRDD does and it looks close to trivial - if I only new Scala. Once I learn Scala, I would say the first thing I plan on doing is writing my own OracleRDD with my own flavor of Jdbc code. Why would this not be advisable? Subject: Re: Is Spark in Java a bad idea? From: matei.zaha...@gmail.com Date: Tue, 28 Oct 2014 11:56:39 -0700 CC: u...@spark.incubator.apache.org To: isasmani@gmail.com A pretty large fraction of users use Java, but a few features are still not available in it. JdbcRDD is one of them -- this functionality will likely be superseded by Spark SQL when we add JDBC as a data source. In the meantime, to use it, I'd recommend writing a class in Scala that has Java-friendly methods and getting an RDD to it from that. Basically the two parameters that weren't friendly there were the ClassTag and the getConnection and mapRow functions. Subclassing RDD in Java is also not really supported, because that's an internal API. We don't expect users to be defining their own RDDs. Matei On Oct 28, 2014, at 11:47 AM, critikaled isasmani@gmail.com wrote: Hi Ron, what ever api you have in scala you can possibly use it form java. scala is inter-operable with java and vice versa. scala being both object oriented and functional will make your job easier on jvm and it is more consise than java. Take it as an opportunity and start learning scala ;). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-in-Java-a-bad-idea-tp17534p17538.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is Spark in Java a bad idea?
Don't be too concerned about the Scala hoop. Before making the commitment to Scala, I had coded up a modest analytic prototype in Hadoop mapreduce. Once making the commitment, it took 10 days to (1) learn enough Scala, and (2) re-write the prototype in Spark in Scala. In so doing, the execution time for this prototype was cut in 1/8 and the lines of code for identical functionality was about 1/10. A few things helped me... - Martin Odersky's "Programming in Scala". No need to read the whole thing, but use it as a reference and together with the course. - His "Functional Programming Principles in Scala" on Coursera. It's not necessary that you enroll in a concurrent course. "Enroll" in a past course and watch the videos and do a few exercises. https://class.coursera.org/progfun-003 - The cheat-cheats on the Scala website. http://docs.scala-lang.org/cheatsheets/?_ga=1.267044046.1769090313.1387491444 - Example code in Spark. Plenty of it to go around. Once you have experienced the glories of Scala, there's no turning back. It is a computer science cornucopia! Kevin On 10/28/2014 01:15 PM, Ron Ayoub wrote: I interpret this to mean you have to learn Scala in order to work with Spark in Scala (goes without saying) and also to work with Spark in Java (since you have to jump through some hoops for basic functionality). The best path here is to take this as a learning opportunity and sit down and learn Scala. Regarding RDD being an internal API, it has two methods that clearly allow you to override them which the JdbcRDD does and it looks close to trivial - if I only new Scala. Once I learn Scala, I would say the first thing I plan on doing is writing my own OracleRDD with my own flavor of Jdbc code. Why would this not be advisable? Subject: Re: Is Spark in Java a bad idea? From: matei.zaha...@gmail.com Date: Tue, 28 Oct 2014 11:56:39 -0700 CC: u...@spark.incubator.apache.org To: isasmani@gmail.com A pretty large fraction of users use Java, but a few features are still not available in it. JdbcRDD is one of them -- this functionality will likely be superseded by Spark SQL when we add JDBC as a data source. In the meantime, to use it, I'd recommend writing a class in Scala that has Java-friendly methods and getting an RDD to it from that. Basically the two parameters that weren't friendly there were the ClassTag and the getConnection and mapRow functions. Subclassing RDD in Java is also not really supported, because that's an internal API. We don't expect users to be defining their own RDDs. Matei On Oct 28, 2014, at 11:47 AM, critikaled isasmani@gmail.com wrote: Hi Ron, what ever api you have in scala you can possibly use it form java. scala is inter-operable with java and vice versa. scala being both object oriented and functional will make your job easier on jvm and it is more consise than java. Take it as an opportunity and start learning scala ;). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-in-Java-a-bad-idea-tp17534p17538.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org