How to join dstream and JDBCRDD with checkpointing enabled

saurabh3d Thu, 03 Nov 2016 01:14:50 -0700

Hi All,

We have a spark streaming job with checkpoint enabled, it executes correctly
first time, but throw below exception when restarted from checkpoint.


org.apache.spark.SparkException: RDD transformations and actions can only be
invoked by the driver, not inside of other transformations; for example,
rdd1.map(x => rdd2.values.count() * x) is invalid because the values
transformation and count action cannot be performed inside of the rdd1.map
transformation. For more information, see SPARK-5063.
        at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:352)
        at org.apache.spark.rdd.RDD.union(RDD.scala:565)
        at
org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:23)
        at
org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:19)
        at
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)

Please suggest any workaround for this issue. 

Code:
        String URL = "jdbc:oracle:thin:" + USERNAME + "/" + PWD + "@//" +
CONNECTION_STRING;

        Map<String, String> options = ImmutableMap.of(
                "driver", "oracle.jdbc.driver.OracleDriver",
                "url", URL,
                "dbtable", "READINGS_10K",
                "fetchSize", "10000");

        DataFrame OracleDB_DF = sqlContext.load("jdbc", options);
                JavaPairRDD<String, Row> OracleDB_RDD = OracleDB_DF.toJavaRDD()
                .mapToPair(x -> new Tuple2(x.getString(0), x));
        
        Dstream
                .transformToPair(
                        rdd -> rdd
                                .mapToPair(
                                        record ->
                                                new Tuple2<>(
                                                       
record.getKey().toString(),
                                                        record))
                                .join(OracleDB_RDD))
                                .print();

Spark version 1.6, running in yarn cluster mode.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-join-dstream-and-JDBCRDD-with-checkpointing-enabled-tp28001.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

How to join dstream and JDBCRDD with checkpointing enabled

Reply via email to