That's exactly what I was doing. However, I ran into runtime issues with doing that. For instance, I had a
public class DbConnection extends AbstractFunction0<Connection> implements Serializable I got a runtime error from Spark complaining that DbConnection wasn't an instance of scala.Function0. I also had a public class MapRow extends scala.runtime.AbstractFunction1<java.sql.ResultSet, Row> implements Serializable with which I seemed to have more luck. On Wed, Feb 18, 2015 at 5:32 PM, Cody Koeninger <[email protected]> wrote: > Cant you implement the > > org.apache.spark.api.java.function.Function > > interface and pass an instance of that to JdbcRDD.create ? > > On Wed, Feb 18, 2015 at 3:48 PM, Dmitry Goldenberg < > [email protected]> wrote: > >> Cody, you were right, I had a copy and paste snag where I ended up with a >> vanilla SparkContext rather than a Java one. I also had to *not* use my >> function subclasses, rather just use anonymous inner classes for the >> Function stuff and that got things working. I'm fully following >> the JdbcRDD.create approach from JavaJdbcRDDSuite.java basically verbatim. >> >> Is there a clean way to refactor out the custom Function classes such as >> the one for getting a db connection or mapping ResultSet data to your own >> POJO's rather than doing it all inline? >> >> >> On Wed, Feb 18, 2015 at 1:52 PM, Cody Koeninger <[email protected]> >> wrote: >> >>> Is sc there a SparkContext or a JavaSparkContext? The compilation error >>> seems to indicate the former, but JdbcRDD.create expects the latter >>> >>> On Wed, Feb 18, 2015 at 12:30 PM, Dmitry Goldenberg < >>> [email protected]> wrote: >>> >>>> I have tried that as well, I get a compile error -- >>>> >>>> [ERROR] ...SparkProto.java:[105,39] error: no suitable method found for >>>> create(SparkContext,<anonymous >>>> ConnectionFactory>,String,int,int,int,<anonymous >>>> Function<ResultSet,Integer>>) >>>> >>>> The code is a copy and paste: >>>> >>>> JavaRDD<Integer> jdbcRDD = JdbcRDD.create( >>>> sc, >>>> new JdbcRDD.ConnectionFactory() { >>>> public Connection getConnection() throws SQLException { >>>> return >>>> DriverManager.getConnection("jdbc:derby:target/JavaJdbcRDDSuiteDb"); >>>> } >>>> }, >>>> "SELECT DATA FROM FOO WHERE ? <= ID AND ID <= ?", >>>> 1, 100, 1, >>>> new Function<ResultSet, Integer>() { >>>> public Integer call(ResultSet r) throws Exception { >>>> return r.getInt(1); >>>> } >>>> } >>>> ); >>>> >>>> The other thing I've tried was to define a static class locally for >>>> GetConnection and use the JdbcCreate constructor. This got around the >>>> compile issues but blew up at runtime with "NoClassDefFoundError: >>>> scala/runtime/AbstractFunction0" ! >>>> >>>> JdbcRDD<Row> jdbcRDD = new JdbcRDD<Row>( >>>> sc, >>>> (AbstractFunction0<Connection>) new DbConn(), // had to cast or a >>>> compile error >>>> SQL_QUERY, >>>> 0L, >>>> 1000L, >>>> 10, >>>> new MapRow(), >>>> ROW_CLASS_TAG); >>>> // DbConn is defined as public static class DbConn extends >>>> AbstractFunction0<Connection> implements Serializable >>>> >>>> On Wed, Feb 18, 2015 at 1:20 PM, Cody Koeninger <[email protected]> >>>> wrote: >>>> >>>>> That test I linked >>>>> >>>>> >>>>> https://github.com/apache/spark/blob/v1.2.1/core/src/test/java/org/apache/spark/JavaJdbcRDDSuite.java#L90 >>>>> >>>>> is calling a static method JdbcRDD.create, not new JdbcRDD. Is that >>>>> what you tried doing? >>>>> >>>>> On Wed, Feb 18, 2015 at 12:00 PM, Dmitry Goldenberg < >>>>> [email protected]> wrote: >>>>> >>>>>> Thanks, Cody. Yes, I originally started off by looking at that but I >>>>>> get a compile error if I try and use that approach: constructor JdbcRDD >>>>>> in >>>>>> class JdbcRDD<T> cannot be applied to given types. Not to mention that >>>>>> JavaJdbcRDDSuite somehow manages to not pass in the class tag (the last >>>>>> argument). >>>>>> >>>>>> Wonder if it's a JDK version issue, I'm using 1.7. >>>>>> >>>>>> So I've got this, which doesn't compile >>>>>> >>>>>> JdbcRDD<Row> jdbcRDD = new JdbcRDD<Row>( >>>>>> new SparkContext(conf), >>>>>> new JdbcRDD.ConnectionFactory() { >>>>>> public Connection getConnection() throws SQLException { >>>>>> Connection conn = null; >>>>>> try { >>>>>> Class.forName(JDBC_DRIVER); >>>>>> conn = DriverManager.getConnection(JDBC_URL, JDBC_USER, >>>>>> JDBC_PASSWORD); >>>>>> } catch (ClassNotFoundException ex) { >>>>>> throw new RuntimeException("Error while loading JDBC driver.", ex); >>>>>> } >>>>>> return conn; >>>>>> } >>>>>> }, >>>>>> "SELECT * FROM EMPLOYEES", >>>>>> 0L, >>>>>> 1000L, >>>>>> 10, >>>>>> new Function<ResultSet, Row>() { >>>>>> public Row call(ResultSet r) throws Exception { >>>>>> return null; // have some actual logic here... >>>>>> } >>>>>> }, >>>>>> scala.reflect.ClassManifestFactory$.MODULE$.fromClass(Row.class)); >>>>>> >>>>>> The other approach was mimicing the DbConnection class from this >>>>>> post: >>>>>> http://www.sparkexpert.com/2015/01/02/load-database-data-into-spark-using-jdbcrdd-in-java/. >>>>>> It got around any of the compilation issues but then I got the runtime >>>>>> error where Spark wouldn't recognize the db connection class as a >>>>>> scala.Function0. >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Feb 18, 2015 at 12:37 PM, Cody Koeninger <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Take a look at >>>>>>> >>>>>>> >>>>>>> https://github.com/apache/spark/blob/v1.2.1/core/src/test/java/org/apache/spark/JavaJdbcRDDSuite.java >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Feb 18, 2015 at 11:14 AM, dgoldenberg < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> I'm reading data from a database using JdbcRDD, in Java, and I have >>>>>>>> an >>>>>>>> implementation of Function0<Connection> whose instance I supply as >>>>>>>> the >>>>>>>> 'getConnection' parameter into the JdbcRDD constructor. Compiles >>>>>>>> fine. >>>>>>>> >>>>>>>> The definition of the class/function is as follows: >>>>>>>> >>>>>>>> public class GetDbConnection extends AbstractFunction0<Connection> >>>>>>>> implements Serializable >>>>>>>> >>>>>>>> where scala.runtime.AbstractFunction0 extends scala.Function0. >>>>>>>> >>>>>>>> At runtime, I get an exception as below. Does anyone have an idea >>>>>>>> as to how >>>>>>>> to resolve this/work around it? Thanks. >>>>>>>> >>>>>>>> I'm running Spark 1.2.1 built for Hadoop 2.4. >>>>>>>> >>>>>>>> >>>>>>>> Exception in thread "main" org.apache.spark.SparkException: Job >>>>>>>> aborted due >>>>>>>> to stage failure: Task 3 in stage 0.0 failed 1 times, most recent >>>>>>>> failure: >>>>>>>> Lost task 3.0 in stage 0.0 (TID 3, localhost): >>>>>>>> java.lang.ClassCastException: >>>>>>>> cannot assign instance of >>>>>>>> com.kona.motivis.spark.proto.GetDbConnection to >>>>>>>> field >>>>>>>> org.apache.spark.rdd.JdbcRDD.org$apache$spark$rdd$JdbcRDD$$getConnection >>>>>>>> of >>>>>>>> type scala.Function0 in instance of org.apache.spark.rdd.JdbcRDD >>>>>>>> at >>>>>>>> >>>>>>>> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083) >>>>>>>> at >>>>>>>> >>>>>>>> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261) >>>>>>>> at >>>>>>>> >>>>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996) >>>>>>>> at >>>>>>>> >>>>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) >>>>>>>> at >>>>>>>> >>>>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) >>>>>>>> at >>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) >>>>>>>> at >>>>>>>> >>>>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) >>>>>>>> at >>>>>>>> >>>>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) >>>>>>>> at >>>>>>>> >>>>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) >>>>>>>> at >>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) >>>>>>>> at >>>>>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) >>>>>>>> at >>>>>>>> >>>>>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) >>>>>>>> at >>>>>>>> >>>>>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) >>>>>>>> at >>>>>>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) >>>>>>>> at org.apache.spark.scheduler.Task.run(Task.scala:56) >>>>>>>> at >>>>>>>> >>>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) >>>>>>>> at >>>>>>>> >>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>>>>>> at >>>>>>>> >>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>>>>>> at java.lang.Thread.run(Thread.java:744) >>>>>>>> >>>>>>>> Driver stacktrace: >>>>>>>> at >>>>>>>> org.apache.spark.scheduler.DAGScheduler.org >>>>>>>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) >>>>>>>> at >>>>>>>> >>>>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) >>>>>>>> at >>>>>>>> >>>>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) >>>>>>>> at >>>>>>>> >>>>>>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) >>>>>>>> at >>>>>>>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) >>>>>>>> at >>>>>>>> >>>>>>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) >>>>>>>> at >>>>>>>> >>>>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) >>>>>>>> at >>>>>>>> >>>>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) >>>>>>>> at scala.Option.foreach(Option.scala:236) >>>>>>>> at >>>>>>>> >>>>>>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696) >>>>>>>> at >>>>>>>> >>>>>>>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420) >>>>>>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >>>>>>>> at >>>>>>>> >>>>>>>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375) >>>>>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >>>>>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:487) >>>>>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) >>>>>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:220) >>>>>>>> at >>>>>>>> >>>>>>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) >>>>>>>> at >>>>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>>>>>> at >>>>>>>> >>>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>>>>>>> at >>>>>>>> >>>>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>>>>>> at >>>>>>>> >>>>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> View this message in context: >>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/JdbcRDD-ClassCastException-with-scala-Function0-tp21707.html >>>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>>> Nabble.com. >>>>>>>> >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
